本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新514篇论文,其中:

  • 自然语言处理142
  • 信息检索15
  • 计算机视觉93

自然语言处理

1. 【2502.14866】LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

链接https://arxiv.org/abs/2502.14866

作者:Shang Yang,Junxian Guo,Haotian Tang,Qinghao Hu,Guangxuan Xiao,Jiaming Tang,Yujun Lin,Zhijian Liu,Yao Lu,Song Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)

关键词:Large language models, large memory footprint, models remains challenging, processing long sequences, shown remarkable potential

备注: Accepted by MLSys 2025. Code available at: [this https URL](https://github.com/mit-han-lab/omniserve)

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at this https URL.

2. 【2502.14862】Interpretable Text Embeddings and Text Similarity Explanation: A Primer

链接https://arxiv.org/abs/2502.14862

作者:Juri Opitz,Lucas Möller,Andrianos Michail,Simon Clematide

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:NLP systems, involving search, text embedding models, similarity scores, Text embeddings

备注

点击查看摘要

Abstract:Text embeddings and text embedding models are a backbone of many AI and NLP systems, particularly those involving search. However, interpretability challenges persist, especially in explaining obtained similarity scores, which is crucial for applications requiring transparency. In this paper, we give a structured overview of interpretability methods specializing in explaining those similarity scores, an emerging research area. We study the methods' individual ideas and techniques, evaluating their potential for improving interpretability of text embeddings and explaining predicted similarities.

3. 【2502.14860】Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

链接https://arxiv.org/abs/2502.14860

作者:Shuyue Stella Li,Jimin Mun,Faeze Brahman,Jonathan S. Ilgen,Yulia Tsvetkov,Maarten Sap

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, making them unreliable, essential for decisionmaking, proactive information-gathering

备注: 22 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.

4. 【2502.14856】FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

链接https://arxiv.org/abs/2502.14856

作者:Weilin Zhao,Tengyu Pan,Xu Han,Yudi Zhang,Ao Sun,Yuxiang Huang,Kaihuo Zhang,Weilun Zhao,Yuxuan Li,Jianyong Wang,Zhiyuan Liu,Maosong Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:auto-regressive generation process, Speculative sampling, large language models, mechanism to produce, forward pass

备注

点击查看摘要

Abstract:Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2.

5. 【2502.14855】Prompt-to-Leaderboard

链接https://arxiv.org/abs/2502.14855

作者:Evan Frick,Connor Chen,Joseph Tennyson,Tianle Li,Wei-Lin Chiang,Anastasios N. Angelopoulos,Ion Stoica

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language model, evaluations typically rely, Large language, typically rely, rely on aggregated

备注

点击查看摘要

Abstract:Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the \#1 spot in the Chatbot Arena leaderboard. Our code is available at this GitHub link: this https URL.

6. 【2502.14854】CLIPPER: Compression enables long-context synthetic data generation

链接https://arxiv.org/abs/2502.14854

作者:Chau Minh Pham,Yapei Chang,Mohit Iyyer

类目:Computation and Language (cs.CL)

关键词:LLM developers, tasks remains challenging, generating high-quality data, remains challenging, developers are increasingly

备注

点击查看摘要

Abstract:LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

7. 【2502.14848】GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks

链接https://arxiv.org/abs/2502.14848

作者:Jianwen Luo,Yiming Huang,Jinxiang Meng,Fangyu Lei,Shizhu He,Xiao Liu,Shanshan Jiang,Bin Dong,Jun Zhao,Kang Liu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, shown great promise, efficiently construct reliable

备注: 8 pages of main text, 38 pages of appendices

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, TabMWP). Our results show that GATE achieves up to 4.3x faster milestone completion in Minecraft compared to the previous SOTA, and provides an average improvement of 9.23% over existing tool-making methods in code generation tasks and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, balancing tool quantity, complexity, and functionality while maintaining high efficiency. Code and data are available at \url{this https URL}.

8. 【2502.14846】Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

链接https://arxiv.org/abs/2502.14846

作者:Yue Yang,Ajay Patel,Matt Deitke,Tanmay Gupta,Luca Weihs,Andrew Head,Mark Yatskar,Chris Callison-Burch,Ranjay Krishna,Aniruddha Kembhavi,Christopher Clark

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:charts and documents, critical application, Reasoning, data, Reasoning about images

备注: 20 pages, 19 figures, 9 tables, website: [this https URL](https://yueyang1996.github.io/cosyn/)

点击查看摘要

Abstract:Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

9. 【2502.14838】Revealing and Mitigating Over-Attention in Knowledge Editing

链接https://arxiv.org/abs/2502.14838

作者:Pinzheng Wang,Zecheng Tang,Keyan Zhou,Juntao Li,Qiaoming Zhu,Min Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, demonstrated superior performance, exhibit undesirable errors, Specificity Failure

备注

点击查看摘要

Abstract:Large Language Models have demonstrated superior performance across a wide range of tasks, but they still exhibit undesirable errors due to incorrect knowledge learned from the training data. To avoid this, knowledge editing methods emerged to precisely edit the specific model knowledge via efficiently modifying a very small percentage of parameters. % However, those methods can lead to the problem of Specificity Failure: when the content related to the edited knowledge occurs in the context, it can inadvertently corrupt other pre-existing knowledge. However, those methods can lead to the problem of Specificity Failure, where the existing knowledge and capabilities are severely degraded due to editing. Our preliminary indicates that Specificity Failure primarily stems from the model's attention heads assigning excessive attention scores to entities related to the edited knowledge, thereby unduly focusing on specific snippets within the context, which we denote as the Attention Drift phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet effective method Selective Attention Drift Restriction}(SADR), which introduces an additional regularization term during the knowledge editing process to restrict changes in the attention weight distribution, thereby preventing undue focus on the edited entity. Experiments on five frequently used strong LLMs demonstrate the effectiveness of our method, where SADR can significantly mitigate Specificity Failure in the predominant knowledge editing tasks.

10. 【2502.14837】owards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

链接https://arxiv.org/abs/2502.14837

作者:Tao Ji,Bin Guo,Yuanbin Wu,Qipeng Guo,Lixing Shen,Zhan Chen,Xipeng Qiu,Qi Zhang,Tao Gui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multi-head Latent Attention, innovative architecture proposed, latent vector, Multi-head Latent, employing Multi-Head Attention

备注: 16 pages, 8 figures

点击查看摘要

Abstract:Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.

11. 【2502.14834】LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

链接https://arxiv.org/abs/2502.14834

作者:Shangqing Tu,Yucheng Wang,Daniel Zhang-Li,Yushi Bai,Jifan Yu,Yuhao Wu,Lei Hou,Huiqin Liu,Zhiyuan Liu,Bin Xu,Juanzi Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Existing Large Vision-Language, Existing Large, Large Vision-Language Models, Large Vision-Language, generate coherent outputs

备注

点击查看摘要

Abstract:Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: this https URL

12. 【2502.14830】Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs

链接https://arxiv.org/abs/2502.14830

作者:Danni Liu,Jan Niehues

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:models demonstrate remarkable, large language models, language models demonstrate, demonstrate remarkable capabilities, extending these benefits

备注

点击查看摘要

Abstract:While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility. However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training. Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages. The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment. Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training. Our code is publicly available (this https URL).

13. 【2502.14829】Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps

链接https://arxiv.org/abs/2502.14829

作者:Martin Tutek,Fateme Hashemi Chaleshtori,Ana Marasović,Yonatan Belinkov

类目:Computation and Language (cs.CL)

关键词:chain of thought, reasoning, reasoning steps, FUR, steps

备注

点击查看摘要

Abstract:When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite much work on CoT prompting, it is unclear if CoT reasoning is faithful to the models' parameteric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters. We perform experiments unlearning CoTs of four LMs prompted on four multi-choice question answering (MCQA) datasets. Our experiments show that FUR is frequently able to change the underlying models' prediction by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning. Importantly, CoT steps identified as important by FUR do not align well with human notions of plausbility, emphasizing the need for specialized alignment

14. 【2502.14820】C-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables

链接https://arxiv.org/abs/2502.14820

作者:Luis Antonio Gutiérrez Guanilo,Mir Tafseer Nayeem,Cristian López,Davood Rafiei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC)

关键词:Large Language Models, Large Language, demonstrated exceptional versatility, remains underexplored due, e-commerce remains underexplored

备注: NAACL 2025 (Industry Track)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional versatility across diverse domains, yet their application in e-commerce remains underexplored due to a lack of domain-specific datasets. To address this gap, we introduce eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, including detailed product attributes and user-specific queries. Leveraging eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to produce high-quality, attribute-specific product reviews from structured tabular data. Fine-tuned models were rigorously evaluated using standard Table2Text metrics, alongside correctness, faithfulness, and fluency assessments. Our results demonstrate substantial improvements in generating contextually accurate reviews, highlighting the transformative potential of tailored datasets and fine-tuning methodologies in optimizing e-commerce workflows. This work highlights the potential of LLMs in e-commerce workflows and the essential role of domain-specific datasets in tailoring them to industry-specific challenges.

15. 【2502.14815】Optimizing Model Selection for Compound AI Systems

链接https://arxiv.org/abs/2502.14815

作者:Lingjiao Chen,Jared Quincy Davis,Boris Hanin,Peter Bailis,Matei Zaharia,James Zou,Ion Stoica

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:combine multiple LLM, multiple LLM calls, achieve strong performance, LLM, achieve strong

备注

点击查看摘要

Abstract:Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules.

16. 【2502.14802】From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

链接https://arxiv.org/abs/2502.14802

作者:Bernal Jiménez Gutiérrez,Yiheng Shu,Weijian Qi,Sizhe Zhou,Yu Su

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:continuously acquire, full potential, key feature, approximate to unlock, unlock their full

备注: Code and data to be released at: [this https URL](https://github.com/OSU-NLP-Group/HippoRAG)

点击查看摘要

Abstract:Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Our code and data will be released at this https URL.

17. 【2502.14791】Rapid Word Learning Through Meta In-Context Learning

链接https://arxiv.org/abs/2502.14791

作者:Wentao Wang,Guangyuan Jiang,Tal Linzen,Brenden M. Lake

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Humans can quickly, quickly learn, systematically and flexibly, word, word learning

备注

点击查看摘要

Abstract:Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word's usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

18. 【2502.14780】ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

链接https://arxiv.org/abs/2502.14780

作者:Abhijit Mishra,Richard Noh,Hsiang Fu,Mingda Li,Minji Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Efficient and privacy-preserving, privacy-preserving multimodal interaction, human-computer communication, Instruction Rewriting, modern smartphones

备注: 12 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

19. 【2502.14778】Harnessing PDF Data for Improving Japanese Large Multimodal Models

链接https://arxiv.org/abs/2502.14778

作者:Jeonghun Baek,Akiko Aizawa,Kiyoharu Aizawa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Multimodal Models, remains limited due, Japanese remains limited, Large Multimodal, Japanese LMMs

备注: 15 pages, 8 figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs. We plan to make the source code and data publicly available upon acceptance.

20. 【2502.14776】SurveyX: Academic Survey Automation via Large Language Models

链接https://arxiv.org/abs/2502.14776

作者:Xun Liang,Jiawei Yang,Yezhaohui Wang,Chen Tang,Zifan Zheng,Simin Niu,Shichao Song,Hanyu Wang,Bo Tang,Feiyu Xiong,Keming Mao,Zhiyu li

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, vast knowledge base, demonstrated exceptional comprehension

备注: 15 pages, 16 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks. Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases. By innovatively introducing online reference retrieval, a pre-processing method called AttributeTree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition. Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions. Examples of surveys generated by SurveyX are available on this http URL

21. 【2502.14768】Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

链接https://arxiv.org/abs/2502.14768

作者:Tian Xie,Zitian Gao,Qingnan Ren,Haoming Luo,Yuqian Hong,Bryan Dai,Joey Zhou,Kai Qiu,Zhirong Wu,Chong Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:rule-based reinforcement learning, reinforcement learning, explore the potential, potential of rule-based, rule-based reinforcement

备注

点击查看摘要

Abstract:Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

22. 【2502.14767】ree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

链接https://arxiv.org/abs/2502.14767

作者:Priyanka Kargupta,Ishika Agarwal,Tal August,Jiawei Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:improved accessibility, exponential growth, facilitated by modern, modern technology, technology and improved

备注: Code available at: [this https URL](https://github.com/pkargupta/tree-of-debate)

点击查看摘要

Abstract:With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.

23. 【2502.14765】Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning

链接https://arxiv.org/abs/2502.14765

作者:Juraj Vladika,Ivana Hacajová,Florian Matthes

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:aims to assess, Fact verification, assess the veracity, based on relevant, relevant evidence

备注: Accepted to NAACL 2025 (Main)

点击查看摘要

Abstract:Fact verification (FV) aims to assess the veracity of a claim based on relevant evidence. The traditional approach for automated FV includes a three-part pipeline relying on short evidence snippets and encoder-only inference models. More recent approaches leverage the multi-turn nature of LLMs to address FV as a step-by-step problem where questions inquiring additional context are generated and answered until there is enough information to make a decision. This iterative method makes the verification process rational and explainable. While these methods have been tested for encyclopedic claims, exploration on domain-specific and realistic claims is missing. In this work, we apply an iterative FV system on three medical fact-checking datasets and evaluate it with multiple settings, including different LLMs, external web search, and structured reasoning using logic predicates. We demonstrate improvements in the final performance over traditional approaches and the high potential of step-by-step FV systems for domain-specific claims.

24. 【2502.14759】On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems

链接https://arxiv.org/abs/2502.14759

作者:Juraj Vladika,Florian Matthes

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval-augmented generation, improving answer factuality, large language models, augment large language, language models

备注: Accepted to Findings of NAACL 2025

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.

25. 【2502.14752】ritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

链接https://arxiv.org/abs/2502.14752

作者:Jianling Li,Shangzhan Li,Zhenye Gao,Qi Shi,Yuxuan Li,Zefan Wang,Jiacheng Huang,Haojie Wang,Jianrong Wang,Xu Han,Zhiyuan Liu,Maosong Sun

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Python-like language designed, high-level Python-like language, deep learning frameworks, learning frameworks due, high-level Python-like

备注

点击查看摘要

Abstract:Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at this https URL.

26. 【2502.14748】Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

链接https://arxiv.org/abs/2502.14748

作者:Zongxia Li,Lorena Calvo-Bartolomé,Alexander Hoyle,Paiheng Xu,Alden Dima,Juan Francisco Fung,Jordan Boyd-Graber

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, large document collections, NLP

备注: 21 Pages. LLM for Data Exploration and content analysis

点击查看摘要

Abstract:A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.

27. 【2502.14744】HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

链接https://arxiv.org/abs/2502.14744

作者:Yilei Jiang,Xinyan Gao,Tianshuo Peng,Yingshui Tan,Xiaoyong Zhu,Bo Zheng,Xiangyu Yue

类目:Computation and Language (cs.CL)

关键词:additional modalities increases, large vision-language models, language-only counterparts, integration of additional, additional modalities

备注

点击查看摘要

Abstract:The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at this https URL.

28. 【2502.14739】SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

链接https://arxiv.org/abs/2502.14739

作者:M-A-P Team,Xinrun Du,Yifan Yao,Kaijing Ma,Bingli Wang,Tianyu Zheng,Kang Zhu,Minghao Liu,Yiming Liang,Xiaolong Jin,Zhenlin Wei,Chujie Zheng,Kaixing Deng,Shuyue Guo,Shian Jia,Sichao Jiang,Yiyan Liao,Rui Li,Qinrui Li,Sirun Li,Yizhi Li,Yunwen Li,Dehua Ma,Yuansheng Ni,Haoran Que,Qiyao Wang,Zhoufutu Wen,Siwei Wu,Tianshun Xing,Ming Xu,Zhenzhu Yang,Zekun Moore Wang,Junting Zhou,Yuelin Bai,Xingyuan Bu,Chenglin Cai,Liang Chen,Yifan Chen,Chengtuo Cheng,Tianhao Cheng,Keyi Ding,Siming Huang,Yun Huang,Yaoru Li,Yizhe Li,Zhaoqun Li,Tianhao Liang,Chengdong Lin,Hongquan Lin,Yinghao Ma,Zhongyuan Peng,Zifan Peng,Qige Qi,Shi Qiu,Xingwei Qu,Yizhou Tan,Zili Wang,Chenqing Wang,Hao Wang,Yiya Wang,Yubo Wang,Jiajun Xu,Kexin Yang,Ruibin Yuan,Yuanhao Yue,Tianyang Zhan,Chun Zhang,Jingyang Zhang,Xiyue Zhang,Xingjian Zhang,Yue Zhang,Yongchi Zhao,Xiangyu Zheng,Chenghua Zhong,Yang Gao,Zhoujun Li,Dayiheng Liu,Qian Liu,Tianyu Liu,Shiwen Ni,Junran Peng,Yujia Qin,Wenbo Su,Guoyin Wang,Shi Wang,Jian Yang,Min Yang,Meng Cao,Xiang Yue,Zhaoxiang Zhang,Wangchunshu Zhou,Jiaheng Liu,Qunshu Lin,Wenhao Huang,Ge Zhang

类目:Computation and Language (cs.CL)

关键词:Large language models, demonstrated remarkable proficiency, Large language, mainstream academic disciplines, computer science

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

29. 【2502.14734】Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models

链接https://arxiv.org/abs/2502.14734

作者:Hongji Li,Andrianos Michail,Reto Gubelmann,Simon Clematide,Juri Opitz

类目:Computation and Language (cs.CL)

关键词:Sentence Smith framework, Sentence Smith, framework that enables, enables controlled, Smith framework

备注

点击查看摘要

Abstract:We propose the Sentence Smith framework that enables controlled and specified manipulation of text meaning. It consists of three main steps: 1. Parsing a sentence into a semantic graph, 2. Applying human-designed semantic manipulation rules, and 3. Generating text from the manipulated graph. A final filtering step (4.) ensures the validity of the applied transformation. To demonstrate the utility of Sentence Smith in an application study, we use it to generate hard negative pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can gain deeper insights into the specific strengths and weaknesses of widely used text embedding models, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that the generations produced by Sentence Smith are highly accurate.

30. 【2502.14718】Entity Framing and Role Portrayal in the News

链接https://arxiv.org/abs/2502.14718

作者:Tarek Mahmoud,Zhuohan Xie,Dimitar Dimitrov,Nikolaos Nikolaidis,Purificação Silvano,Roman Yangarber,Shivam Sharma,Elisa Sartori,Nicolas Stefanovitch,Giovanni Da San Martino,Jakub Piskorski,Preslav Nakov

类目:Computation and Language (cs.CL)

关键词:hierarchical corpus annotated, multilingual hierarchical corpus, role portrayal, corpus annotated, entity framing

备注: 23 pages, 12 figures. Submitted to ACL Rolling Review (ARR)

点击查看摘要

Abstract:We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.

31. 【2502.14714】From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

链接https://arxiv.org/abs/2502.14714

作者:Ahmed Abdeen Hamed,Byung Suk Lee

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:LLM models present, models present opportunities, LLM model, select LLM model, LLM models

备注: 26 pages, 6 figures, In Review with a Cell Press Journal

点击查看摘要

Abstract:The generative capabilities of LLM models present opportunities in accelerating tasks and concerns with the authenticity of the knowledge it produces. To address the concerns, we present a computational approach that systematically evaluates the factual accuracy of biomedical knowledge that an LLM model has been prompted to generate. Our approach encompasses two processes: the generation of disease-centric associations and the verification of them using the semantic knowledge of the biomedical ontologies. Using ChatGPT as the select LLM model, we designed a set of prompt-engineering processes to generate linkages between diseases, drugs, symptoms, and genes to establish grounds for assessments. Experimental results demonstrate high accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and genetic information (88%-98%). The symptom term identification accuracy was notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO ontologies accordingly. The verification of associations reveals literature coverage rates of (89%-91%) among disease-drug and disease-gene associations. The low identification accuracy for symptom terms also contributed to the verification of symptom-related associations (49%-62%).

32. 【2502.14709】Data-Efficient Pretraining with Group-Level Data Influence Modeling

链接https://arxiv.org/abs/2502.14709

作者:Zichun Yu,Fei Peng,Jie Lei,Arnold Overwijk,Wen-tau Yih,Chenyan Xiong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:elevate scaling laws, shown tremendous potential, scaling laws, data, shown tremendous

备注

点击查看摘要

Abstract:Data-efficient pretraining has shown tremendous potential to elevate scaling laws. This paper argues that effective pretraining data should be curated at the group level, treating a set of data points as a whole rather than as independent contributors. To achieve that, we propose Group-Level Data Influence Modeling (Group-MATES), a novel data-efficient pretraining method that captures and optimizes group-level data utility. Specifically, Group-MATES collects oracle group-level influences by locally probing the pretraining model with data sets. It then fine-tunes a relational data influence model to approximate oracles as relationship-weighted aggregations of individual influences. The fine-tuned model selects the data subset by maximizing its group-level influence prediction, with influence-aware clustering to enable efficient inference. Experiments on the DCLM benchmark demonstrate that Group-MATES achieves a 10% relative core score improvement on 22 downstream tasks over DCLM-Baseline and 5% over individual-influence-based methods, establishing a new state-of-the-art. Further analyses highlight the effectiveness of relational data influence models in capturing intricate interactions between data points.

33. 【2502.14693】I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search

链接https://arxiv.org/abs/2502.14693

作者:Zujie Liang,Feng Wei,Wujiang Xu,Lin Chen,Yuxi Qian,Xinhui Wu

类目:Computation and Language (cs.CL)

关键词:shown remarkable potential, automating machine learning, Monte Carlo Tree, Carlo Tree Search, machine learning tasks

备注

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making this http URL, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node's solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed this http URL to the various ML tasks, our approach demonstrates a6\% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems.

34. 【2502.14682】Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup

链接https://arxiv.org/abs/2502.14682

作者:Yonghui Kong,Hongbing Hu,Dan Zhang,Siyuan Chai,Fan Zhang,Wei Wang

类目:Computation and Language (cs.CL)

关键词:Large language models, in-context learning capabilities, powerful in-context learning, Large language, demonstrated excellent performance

备注

点击查看摘要

Abstract:Large language models have demonstrated excellent performance in many tasks, including Text-to-SQL, due to their powerful in-context learning capabilities. They are becoming the mainstream approach for Text-to-SQL. However, these methods still have a significant gap compared to human performance, especially on complex questions. As the complexity of questions increases, the gap between questions and SQLs increases. We identify two important gaps: the structural mapping gap and the lexical mapping gap. To tackle these two gaps, we propose PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). AQP aims to obtain the structural pattern of the question by removing database-related information, which enables us to find structurally similar demonstrations. CSM aims to associate database-related text span in the question with specific tables or columns in the database, which alleviates the lexical mapping gap. Experimental results on the Spider and BIRD datasets demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an execution accuracy of 64.67\%.

35. 【2502.14678】How to Get Your LLM to Generate Challenging Problems for Evaluation

链接https://arxiv.org/abs/2502.14678

作者:Arkil Patel,Siva Reddy,Dzmitry Bahdanau

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, evolution of Large, necessitates new approaches

备注

点击查看摘要

Abstract:The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

36. 【2502.14677】Data-Constrained Synthesis of Training Data for De-Identification

链接https://arxiv.org/abs/2502.14677

作者:Thomas Vakili,Aron Henriksson,Hercules Dalianis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:NER models, lack widely, privacy risks, due to privacy, synthetic NER models

备注: Under review

点击查看摘要

Abstract:Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

37. 【2502.14671】Explanations of Deep Language Models Explain Language Representations in the Brain

链接https://arxiv.org/abs/2502.14671

作者:Maryam Rahimi(1),Yadollah Yaghoobzadeh(2 and 4),Mohammad Reza Daliri(1 and 3) ((1) Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran, (2) Electrical and Computer Engineering Department, University of Tehran, Tehran, Iran, (3) School of Cognitive Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran, (4) Tehran Institute for Advanced Studies, Khatam University, Tehran, Iran)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)

关键词:achieve human-like performance, share computational principles, Recent advances, large language models, advances in artificial

备注

点击查看摘要

Abstract:Recent advances in artificial intelligence have given rise to large language models (LLMs) that not only achieve human-like performance but also share computational principles with the brain's language processing mechanisms. While previous research has primarily focused on aligning LLMs' internal representations with neural activity, we introduce a novel approach that leverages explainable AI (XAI) methods to forge deeper connections between the two domains. Using attribution methods, we quantified how preceding words contribute to an LLM's next-word predictions and employed these explanations to predict fMRI recordings from participants listening to the same narratives. Our findings demonstrate that attribution methods robustly predict brain activity across the language network, surpassing traditional internal representations in early language areas. This alignment is hierarchical: early-layer explanations correspond to the initial stages of language processing in the brain, while later layers align with more advanced stages. Moreover, the layers more influential on LLM next-word prediction$\unicode{x2014}$those with higher attribution scores$\unicode{x2014}$exhibited stronger alignment with neural activity. This work establishes a bidirectional bridge between AI and neuroscience. First, we demonstrate that attribution methods offer a powerful lens for investigating the neural mechanisms of language comprehension, revealing how meaning emerges from preceding context. Second, we propose using brain alignment as a metric to evaluate the validity of attribution methods, providing a framework for assessing their biological plausibility.

38. 【2502.14669】AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

链接https://arxiv.org/abs/2502.14669

作者:Alan Dao(Gia Tuan Dao),Dinh Bach Vu

类目:Computation and Language (cs.CL)

关键词:demonstrated impressive capabilities, Large Language Models, requiring genuine visual, tasks requiring genuine, Large Language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

39. 【2502.14662】InstructAgent: Building User Controllable Recommender via LLM Agent

链接https://arxiv.org/abs/2502.14662

作者:Wujiang Xu,Yunxiao Shi,Zujie Liang,Xuying Ning,Kai Mei,Kun Wang,Xi Zhu,Min Xu,Yongfeng Zhang

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:platform recommendation algorithms, directly exposed, recommendation algorithms, users, paradigm

备注: WWW2025@HCRS

点击查看摘要

Abstract:Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as $\dataset$, along with user instructions for each record.

40. 【2502.14645】Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs

链接https://arxiv.org/abs/2502.14645

作者:Yuchen Wu,Liang Ding,Li Shen,Dacheng Tao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:requiring full retraining, full retraining, Knowledge Democracy Edit, efficient adaptation, adaptation of large

备注

点击查看摘要

Abstract:Knowledge editing allows for efficient adaptation of large language models (LLMs) to new information or corrections without requiring full retraining. However, prior methods typically focus on either single-language editing or basic multilingual editing, failing to achieve true cross-linguistic knowledge synchronization. To address this, we present a simple and practical state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), designed to propagate knowledge from a dominant language to other languages effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel dataset to modify in-scope knowledge while preserving unrelated information, and (ii) Target-language Preference Optimization (TL-PO), which applies advanced optimization techniques to ensure consistency across languages, fostering the transfer of updates. Additionally, we contribute a high-quality, cross-lingual dataset, specifically designed to enhance knowledge transfer across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks show that X-KDE significantly enhances cross-lingual performance, achieving an average improvement of +8.19%, while maintaining high accuracy in monolingual settings.

41. 【2502.14644】LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning

链接https://arxiv.org/abs/2502.14644

作者:Yansheng Mao,Yufei Xu,Jiaqi Li,Fanxu Meng,Haotong Yang,Zilong Zheng,Xiyuan Wang,Muhan Zhang

类目:Computation and Language (cs.CL)

关键词:language models due, large language models, Long Input, understanding remains challenging, Long context understanding

备注: arXiv admin note: text overlap with [arXiv:2412.13626](https://arxiv.org/abs/2412.13626)

点击查看摘要

Abstract:Long context understanding remains challenging for large language models due to their limited context windows. This paper presents Long Input Fine-Tuning (LIFT), a novel framework for long-context modeling that can improve the long-context performance of arbitrary (short-context) LLMs by dynamically adapting model parameters based on the long input. Importantly, LIFT, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, chooses to store and absorb the long input in parameter. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference. Furthermore, to enhance LIFT performance while maintaining the original in-context learning (ICL) capabilities, we introduce Gated Memory, a specialized attention adapter that automatically balances long input memorization and ICL. We provide a comprehensive analysis of the strengths and limitations of LIFT on long context understanding, offering valuable directions for future research.

42. 【2502.14643】Length-Controlled Margin-Based Preference Optimization without Reference Model

链接https://arxiv.org/abs/2502.14643

作者:Gengxu Li,Tingyu Xia,Yi Chang,Yuan Wu

类目:Computation and Language (cs.CL)

关键词:Direct Preference Optimization, widely adopted offline, adopted offline algorithm, preference-based reinforcement learning, Direct Preference

备注

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. This loss function regulates response length while simultaneously widening the margin between preferred and rejected outputs. By doing so, it mitigates probability degradation for both accepted and discarded responses, addressing a significant limitation of existing methods. We evaluate LMPO against state-of-the-art preference optimization techniques on two open-ended large language models, Mistral and LLaMA3, across six conditional benchmarks. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches. The code is available at \url{this https URL}.

43. 【2502.14642】How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation

链接https://arxiv.org/abs/2502.14642

作者:Rui Li,Heming Xia,Xinfeng Yuan,Qingxiu Dong,Lei Sha,Wenjie Li,Zhifang Sui

类目:Computation and Language (cs.CL)

关键词:virtual proxies designed, garnered increasing attention, autonomously perform tasks, human digital twins, digital twins

备注

点击查看摘要

Abstract:Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf. However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins. To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to simulate continuous human behavior. BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata. For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.

44. 【2502.14638】NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

链接https://arxiv.org/abs/2502.14638

作者:Zheyuan Zhang,Runze Li,Tasnim Kabir,Jordan Boyd-Graber

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:requires complex reasoning, cultural contexts, predicting the specific, specific location, requires complex

备注

点击查看摘要

Abstract:Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at this https URL.

45. 【2502.14628】PEARL: Towards Permutation-Resilient LLMs

链接https://arxiv.org/abs/2502.14628

作者:Liang Chen,Li Shen,Yang Deng,Xiaoyan Zhao,Bin Liang,Kam-Fai Wong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, capability of large, large language, provided demonstrations, ICL

备注: ICLR 2025

点击查看摘要

Abstract:The in-context learning (ICL) capability of large language models (LLMs) enables them to perform challenging tasks using provided demonstrations. However, ICL is highly sensitive to the ordering of demonstrations, leading to instability in predictions. This paper shows that this vulnerability can be exploited to design a natural attack - difficult for model providers to detect - that achieves nearly 80% success rate on LLaMA-3 by simply permuting the demonstrations. Existing mitigation methods primarily rely on post-processing and fail to enhance the model's inherent robustness to input permutations, raising concerns about safety and reliability of LLMs. To address this issue, we propose Permutation-resilient learning (PEARL), a novel framework based on distributionally robust optimization (DRO), which optimizes model performance against the worst-case input permutation. Specifically, PEARL consists of a permutation-proposal network (P-Net) and the LLM. The P-Net generates the most challenging permutations by treating it as an optimal transport problem, which is solved using an entropy-constrained Sinkhorn algorithm. Through minimax optimization, the P-Net and the LLM iteratively optimize against each other, progressively improving the LLM's robustness. Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance. Notably, despite being trained on fewer shots and shorter contexts, PEARL achieves performance gains of up to 40% when scaled to many-shot and long-context scenarios, highlighting its efficiency and generalization capabilities.

46. 【2502.14625】Multi-Record Web Page Information Extraction From News Websites

链接https://arxiv.org/abs/2502.14625

作者:Alexander Kustenkov,Maksim Varlamov,Alexander Yatskov

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:massive web data, pages, web pages, problem of extracting, growing importance

备注

点击查看摘要

Abstract:In this paper, we focused on the problem of extracting information from web pages containing many records, a task of growing importance in the era of massive web data. Recently, the development of neural network methods has improved the quality of information extraction from web pages. Nevertheless, most of the research and datasets are aimed at studying detailed pages. This has left multi-record "list pages" relatively understudied, despite their widespread presence and practical significance. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. This is the first dataset for this task in the Russian language. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity. Our dataset contains attributes of various types, including optional and multi-valued, providing a realistic representation of real-world list pages. These features make our dataset a valuable resource for studying information extraction from pages containing many records. Furthermore, we proposed our own multi-stage information extraction methods. In this work, we explore and demonstrate several strategies for applying MarkupLM to the specific challenges of multi-record web pages. Our experiments validate the advantages of our methods. By releasing our dataset to the public, we aim to advance the field of information extraction from multi-record pages.

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2502.14625 [cs.CL]

(or
arXiv:2502.14625v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2502.14625

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Aleksandr Yatskov [view email] [v1]
Thu, 20 Feb 2025 15:05:00 UTC (298 KB)

47. 【2502.14620】Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity

链接https://arxiv.org/abs/2502.14620

作者:Xinghan Pan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:language model architecture, linear attention mechanism, Research Paraphrase Corpus, Microsoft Research Paraphrase, attention mechanism

备注: 17 pages, 3 tables, preprint on ArXiV, includes detailed analysis of RWKV for semantic similarity tasks

点击查看摘要

Abstract:This paper investigates the efficacy of RWKV, a novel language model architecture known for its linear attention mechanism, for generating sentence embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate the semantic similarity captured by embeddings from different hidden layers of a pre-trained RWKV model. The performance is assessed on the Microsoft Research Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared against a GloVe-based baseline. My results indicate that while RWKV embeddings capture some semantic relatedness, they underperform compared to the GloVe baseline in terms of Spearman correlation. I also analyze the inference time and GPU memory usage, highlighting the computational trade-offs associated with RWKV embeddings. The findings suggest that while RWKV offers potential advantages in terms of linear scaling, its zero-shot sentence embedding quality for semantic similarity tasks requires further investigation and potential task-specific fine-tuning to match or exceed simpler baselines.

48. 【2502.14619】Reward Models Identify Consistency, Not Causality

链接https://arxiv.org/abs/2502.14619

作者:Yuhui Xu,Hanze Dong,Lei Wang,Caiming Xiong,Junnan Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:aligning large language, large language models, play a crucial, aligning large, large language

备注: 16 pages

点击查看摘要

Abstract:Reward models (RMs) play a crucial role in aligning large language models (LLMs) with human preferences and enhancing reasoning quality. Traditionally, RMs are trained to rank candidate outputs based on their correctness and coherence. However, in this work, we present several surprising findings that challenge common assumptions about RM behavior. Our analysis reveals that state-of-the-art reward models prioritize structural consistency over causal correctness. Specifically, removing the problem statement has minimal impact on reward scores, whereas altering numerical values or disrupting the reasoning flow significantly affects RM outputs. Furthermore, RMs exhibit a strong dependence on complete reasoning trajectories truncated or incomplete steps lead to significant variations in reward assignments, indicating that RMs primarily rely on learned reasoning patterns rather than explicit problem comprehension. These findings hold across multiple architectures, datasets, and tasks, leading to three key insights: (1) RMs primarily assess coherence rather than true reasoning quality; (2) The role of explicit problem comprehension in reward assignment is overstated; (3) Current RMs may be more effective at ranking responses than verifying logical validity. Our results suggest a fundamental limitation in existing reward modeling approaches, emphasizing the need for a shift toward causality-aware reward models that go beyond consistency-driven evaluation.

49. 【2502.14614】FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis

链接https://arxiv.org/abs/2502.14614

作者:Mingyi Jia,Junwen Duan,Yan Song,Jianxin Wang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Retrieval-Augmented Large Language, Language Models, Large Language, shown remarkable performance

备注

点击查看摘要

Abstract:Retrieval-Augmented Large Language Models (LLMs), which integrate external knowledge into LLMs, have shown remarkable performance in various medical domains, including clinical diagnosis. However, existing RAG methods struggle to effectively assess task difficulty to make retrieval decisions, thereby failing to meet the clinical requirements for balancing efficiency and accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained \textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework that improves the reliability of RAG in disease diagnosis scenarios. FIND incorporates a fine-grained adaptive control module to determine whether retrieval is necessary based on the information density of the input. By optimizing the retrieval process and implementing a knowledge filtering module, FIND ensures that the retrieval is better suited to clinical scenarios. Experiments on three Chinese electronic medical record datasets demonstrate that FIND significantly outperforms various baseline methods, highlighting its effectiveness in clinical diagnosis tasks.

50. 【2502.14613】Behavioral Analysis of Information Salience in Large Language Models

链接https://arxiv.org/abs/2502.14613

作者:Jan Trienes,Jörg Schlötterer,Junyi Jessy Li,Christin Seifert

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, select content based, excel at text, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.

51. 【2502.14581】A Statistical Case Against Empirical Human-AI Alignment

链接https://arxiv.org/abs/2502.14581

作者:Julian Rodemann,Esteban Garces Arias,Christoph Luther,Christoph Jansen,Thomas Augustin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Other Statistics (stat.OT)

关键词:observed human behavior, human-AI alignment aims, Empirical human-AI alignment, human behavior, aims to make

备注: 24 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Empirical human-AI alignment aims to make AI systems act in line with observed human behavior. While noble in its goals, we argue that empirical alignment can inadvertently introduce statistical biases that warrant caution. This position paper thus advocates against naive empirical alignment, offering prescriptive alignment and a posteriori empirical alignment as alternatives. We substantiate our principled argument by tangible examples like human-centric decoding of language models.

52. 【2502.14565】ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

链接https://arxiv.org/abs/2502.14565

作者:Hyunseok Lee,Seunghyuk Oh,Jaehyung Kim,Jinwoo Shin,Jihoon Tack

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, human intelligence, making its replication, language models, ability to assess

备注

点击查看摘要

Abstract:Self-awareness, i.e., the ability to assess and correct one's own generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or rather relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. We introduce a structured curriculum based upon online preference learning to implement this efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.

53. 【2502.14561】Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

链接https://arxiv.org/abs/2502.14561

作者:Paris Koloveas,Serafeim Chatzopoulos,Thanasis Vergoulis,Christos Tryfonopoulos

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词:open Large Language, Large Language Models, Large Language, open Large, learning and fine-tuning

备注

点击查看摘要

Abstract:This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches that rely on pre-trained models like SciBERT, which require extensive domain-specific pretraining and specialized architectures, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero, one, few, and many-shot prompting to assess performance across scenarios. Our experimental study identifies the top-performing model through extensive experimentation of in-context learning-related parameters, which we fine-tune to further enhance task performance. The results highlight the strengths and limitations of LLMs in recognizing citation intents, providing valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.

54. 【2502.14560】Less is More: Improving LLM Alignment via Preference Data Selection

链接https://arxiv.org/abs/2502.14560

作者:Xun Deng,Han Zhong,Rui Ai,Fuli Feng,Zheng Wang,Xiangnan He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Direct Preference Optimization, aligning large language, Direct Preference, large language models, DPO

备注

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To accurately estimate margins for data selection, we propose a dual-margin guided approach that considers both external reward margins and implicit DPO reward margins. Extensive experiments demonstrate that our method reduces computational cost dramatically while improving performance. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama and Mistral series models on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, while further reducing training time. These results highlight the potential of data selection strategies for advancing preference optimization.

55. 【2502.14553】Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

链接https://arxiv.org/abs/2502.14553

作者:Eric Egli,Matteo Manica,Jannis Born

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:promising building block, Byte Language Models, Multiscale Byte Language, Byte Language, form the basis

备注: Under Review

点击查看摘要

Abstract:Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: this https URL

56. 【2502.14541】LLM-based User Profile Management for Recommender System

链接https://arxiv.org/abs/2502.14541

作者:Seunghwan Bang,Hwanjun Song

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, advancement of Large, enabling zero-shot recommendation

备注: Submitted to ACL 2025

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has opened new opportunities in recommender systems by enabling zero-shot recommendation without conventional training. Despite their potential, most existing works rely solely on users' purchase histories, leaving significant room for improvement by incorporating user-generated textual data, such as reviews and product descriptions. Addressing this gap, we propose PURE, a novel LLM-based recommendation framework that builds and maintains evolving user profiles by systematically extracting and summarizing key information from user reviews. PURE consists of three core components: a Review Extractor for identifying user preferences and key product features, a Profile Updater for refining and updating user profiles, and a Recommender for generating personalized recommendations using the most current profile. To evaluate PURE, we introduce a continuous sequential recommendation task that reflects real-world scenarios by adding reviews over time and updating predictions incrementally. Our experimental results on Amazon datasets demonstrate that PURE outperforms existing LLM-based methods, effectively leveraging long-term user information while managing token limitations.

57. 【2502.14538】LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization

链接https://arxiv.org/abs/2502.14538

作者:Yupeng Chang,Chenlu Guo,Yi Chang,Yuan Wu

类目:Computation and Language (cs.CL)

关键词:achieved remarkable success, Large Language Models, fine-tuning remains resource-intensive, full fine-tuning remains, Large Language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in natural language processing, but their full fine-tuning remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), have emerged as a practical solution by approximating parameter updates with low-rank matrices. However, LoRA often exhibits a "double descent" phenomenon during fine-tuning, where model performance degrades due to overfitting and limited expressiveness caused by low-rank constraints. To address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation Optimization), a novel method that leverages gradient and weight norms to generate targeted perturbations. By optimizing the sharpness of the loss landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the double descent problem and improving generalization. Extensive experiments on natural language understanding (NLU) and generation (NLG) tasks demonstrate that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, extended experiments specifically designed to analyze the double descent phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing more robust and generalizable models. Our work provides a robust and efficient solution for fine-tuning LLMs, with broad applicability in real-world scenarios. The code is available at this https URL.

58. 【2502.14529】CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models

链接https://arxiv.org/abs/2502.14529

作者:Zhenhong Zhou,Zherui Li,Jie Zhang,Yuanhe Zhang,Kun Wang,Yang Liu,Qing Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model-based, Language Model-based Multi-Agent, Model-based Multi-Agent Systems, Large Language, Language Model-based

备注

点击查看摘要

Abstract:Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of Corba in complex topology structures and open-source models. Our code is available at: this https URL.

59. 【2502.14523】Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation

链接https://arxiv.org/abs/2502.14523

作者:Austin A. Barr,Robert Rozman,Eddie Guo

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Real Estate Valuation, data, tabular data, Fish Measurements, Estate Valuation

备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:We propose a new framework for zero-shot generation of synthetic tabular data. Using the large language model (LLM) GPT-4o and plain-language prompting, we demonstrate the ability to generate high-fidelity tabular data without task-specific fine-tuning or access to real-world data (RWD) for pre-training. To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional tabular generative adversarial network (CTGAN), across three open-access datasets: Iris, Fish Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes. Notably, correlations between parameters were consistently preserved with appropriate direction and strength. However, refinement is necessary to better retain distributional characteristics. These findings highlight the potential of LLMs in tabular data synthesis, offering an accessible alternative to generative adversarial networks and variational autoencoders.

60. 【2502.14509】MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality

链接https://arxiv.org/abs/2502.14509

作者:Artur Kot,Mikołaj Koszowski,Wojciech Chojnowski,Mieszko Rutkowski,Artur Nowakowski,Kamil Guttmann,Mikołaj Pokrywka

类目:Computation and Language (cs.CL)

关键词:Cross-lingual Knowledge Transfer, Knowledge Transfer, Neural Machine Translation, Cross-lingual Knowledge, Slavic Neural Machine

备注

点击查看摘要

Abstract:Does multilingual Neural Machine Translation (NMT) lead to The Curse of the Multlinguality or provides the Cross-lingual Knowledge Transfer within a language family? In this study, we explore multiple approaches for extending the available data-regime in NMT and we prove cross-lingual benefits even in 0-shot translation regime for low-resource languages. With this paper, we provide state-of-the-art open-source NMT models for translating between selected Slavic languages. We released our models on the HuggingFace Hub (this https URL) under the CC BY 4.0 license. Slavic language family comprises morphologically rich Central and Eastern European languages. Although counting hundreds of millions of native speakers, Slavic Neural Machine Translation is under-studied in our opinion. Recently, most NMT research focuses either on: high-resource languages like English, Spanish, and German - in WMT23 General Translation Task 7 out of 8 task directions are from or to English; massively multilingual models covering multiple language groups; or evaluation techniques.

61. 【2502.14507】Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases

链接https://arxiv.org/abs/2502.14507

作者:Rena Gao,Xuetong Wu,Tatsuki Kuribayashi,Mingrui Ye,Siya Qi,Carsten Roever,Yuanxing Liu,Zheng Yuan,Jey Han Lau

类目:Computation and Language (cs.CL)

关键词:Large Language Models', study evaluates Large, evaluates Large Language, evaluates Large, ability to simulate

备注

点击查看摘要

Abstract:This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.

62. 【2502.14502】How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

链接https://arxiv.org/abs/2502.14502

作者:Sergey Pletenev,Maria Marina,Daniil Moskovskiy,Vasily Konovalov,Pavel Braslavski,Alexander Panchenko,Mikhail Salnikov

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, tasks is greatly, greatly limited

备注

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model's parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model's performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.

63. 【2502.14501】owards a Perspectivist Turn in Argument Quality Assessment

链接https://arxiv.org/abs/2502.14501

作者:Julia Romberg,Maximilian Maurer,Henning Wachsmuth,Gabriella Lapesa

类目:Computation and Language (cs.CL)

关键词:multiple valid assessments, unequivocal ground truth, well-established logical, unavoidably subjective, multiple valid

备注: Accepted to NAACL 2025

点击查看摘要

Abstract:The assessment of argument quality depends on well-established logical, rhetorical, and dialectical properties that are unavoidably subjective: multiple valid assessments may exist, there is no unequivocal ground truth. This aligns with recent paths in machine learning, which embrace the co-existence of different perspectives. However, this potential remains largely unexplored in NLP research on argument quality. One crucial reason seems to be the yet unexplored availability of suitable datasets. We fill this gap by conducting a systematic review of argument quality datasets. We assign them to a multi-layered categorization targeting two aspects: (a) What has been annotated: we collect the quality dimensions covered in datasets and consolidate them in an overarching taxonomy, increasing dataset comparability and interoperability. (b) Who annotated: we survey what information is given about annotators, enabling perspectivist research and grounding our recommendations for future actions. To this end, we discuss datasets suitable for developing perspectivist models (i.e., those containing individual, non-aggregated annotations), and we showcase the importance of a controlled selection of annotators in a pilot study.

64. 【2502.14499】MLGym: A New Framework and Benchmark for Advancing AI Research Agents

链接https://arxiv.org/abs/2502.14499

作者:Deepak Nathani,Lovish Madaan,Nicholas Roberts,Nikolay Bashlykov,Ajay Menon,Vincent Moens,Amar Budhiraja,Despoina Magka,Vladislav Vorotilov,Gaurav Chaurasia,Dieuwke Hupkes,Ricardo Silveira Cabral,Tatiana Shavrina,Jakob Foerster,Yoram Bachrach,William Yang Wang,Roberta Raileanu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:introduce Meta MLGym, introduce Meta, developing LLM agents, Meta MLGym, evaluating and developing

备注: 35 pages, 12 figures, 10 tables

点击查看摘要

Abstract:We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

65. 【2502.14497】Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups

链接https://arxiv.org/abs/2502.14497

作者:Felix Drinkall,Stefan Zohren,Michael McMahon,Janet B. Pierrehumbert

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)

关键词:mutually reinforcing cycle, Macroeconomic fluctuations, reinforcing cycle, public discourse, stories that propagate

备注

点击查看摘要

Abstract:Macroeconomic fluctuations and the narratives that shape them form a mutually reinforcing cycle: public discourse can spur behavioural changes leading to economic shifts, which then result in changes in the stories that propagate. We show that shifts in semantic embedding space can be causally linked to financial market shocks -- deviations from the expected market behaviour. Furthermore, we show how partisanship can influence the predictive power of text for market fluctuations and shape reactions to those same shocks. We also provide some evidence that text-based signals are particularly salient during unexpected events such as COVID-19, highlighting the value of language data as an exogenous variable in economic forecasting. Our findings underscore the bidirectional relationship between news outlets and market shocks, offering a novel empirical approach to studying their effect on each other.

66. 【2502.14496】Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization

链接https://arxiv.org/abs/2502.14496

作者:Zhitao He,Zijun Liu,Peng Li,May Fung,Ming Yan,Ji Zhang,Fei Huang,Yang Liu

类目:Computation and Language (cs.CL)

关键词:made significant advancements, web browsing, made significant, significant advancements, mobile operations

备注: 24 pages, under review

点击查看摘要

Abstract:LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents' policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.

67. 【2502.14494】StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

链接https://arxiv.org/abs/2502.14494

作者:Jinnan Li,Jinzhe Li,Yue Wang,Yi Chang,Yuan Wu

类目:Computation and Language (cs.CL)

关键词:large language models, real-world applications, constitutes a core, core competency, competency of large

备注: 18 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at \url{this https URL}.

68. 【2502.14486】How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

链接https://arxiv.org/abs/2502.14486

作者:Zhuohang Long,Siyuan Wang,Shujun Liu,Yuhang Lai,Xuanjing Huang,Zhongyu Wei

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:prompts bypass generative, bypass generative models', generative models' built-in, models' built-in safety, harmful prompts bypass

备注

点击查看摘要

Abstract:Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

69. 【2502.14482】NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models

链接https://arxiv.org/abs/2502.14482

作者:Chenlu Guo,Yuan Wu,Yi Chang

类目:Computation and Language (cs.CL)

关键词:adapting large language, Parameter-efficient fine-tuning, essential for adapting, adapting large, Singular Value Decomposition

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) is essential for adapting large language models (LLMs), with low-rank adaptation (LoRA) being the most popular approach. However, LoRA suffers from slow convergence, and some recent LoRA variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) for initialization, leading to expensive computation. To mitigate these problems, we use the Nyström method, which follows a three-matrix manipulation. We first introduce StructuredLoRA (SLoRA), which investigates adding a small intermediate matrix between the low-rank matrices A and B. Secondly, we propose NyströmLoRA (NLoRA), which leverages Nyström-based initialization for SLoRA to improve its effectiveness and efficiency. Finally, we propose IntermediateTune (IntTune), which explores fine-tuning exclusively on the intermediate matrix of NLoRA to further boost LLM efficiency. We evaluate our methods on five natural language generation (NLG) tasks and eight natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with only 3.67 million additional trainable parameters. IntTune improves average NLG performance over LoRA by 7.45% while using only 1.25% of its parameters. These results demonstrate the efficiency and effectiveness of our approach in enhancing model performance with minimal parameter overhead.

70. 【2502.14477】Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

链接https://arxiv.org/abs/2502.14477

作者:Haoyu Wang,Tong Teng,Tianyu Guo,An Xiao,Duyu Tang,Hanting Chen,Yunhe Wang

类目:Computation and Language (cs.CL)

关键词:Handling long-context sequences, large language models, Handling long-context, language models, remains a significant

备注: 14 pages,2 figures

点击查看摘要

Abstract:Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.

71. 【2502.14476】Argument-Based Comparative Question Answering Evaluation Benchmark

链接https://arxiv.org/abs/2502.14476

作者:Irina Nikishina,Saba Anwar,Nikolay Dolgov,Maria Manina,Daria Ignatenko,Viktor Moskvoretskii,Artem Shelmanov,Tim Baldwin,Chris Biemann

类目:Computation and Language (cs.CL)

关键词:comparative question answering, automatic comparative question, comparative question, aim to solve, solve the problems

备注: 8 pages, 7 Tables, 13 Figures, 18 pages with Appendix

点击查看摘要

Abstract:In this paper, we aim to solve the problems standing in the way of automatic comparative question answering. To this end, we propose an evaluation framework to assess the quality of comparative question answering summaries. We formulate 15 criteria for assessing comparative answers created using manual annotation and annotation from 6 large language models and two comparative question asnwering datasets. We perform our tests using several LLMs and manual annotation under different settings and demonstrate the constituency of both evaluations. Our results demonstrate that the Llama-3 70B Instruct model demonstrates the best results for summary evaluation, while GPT-4 is the best for answering comparative questions. All used data, code, and evaluation results are publicly available\footnote{\url{this https URL}}.

72. 【2502.14469】Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models

链接https://arxiv.org/abs/2502.14469

作者:Aurora Polo-Rodríguez,Laura Fiorini,Erika Rovini,Filippo Cavallo,Javier Medina-Quero

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

关键词:Large Language Models, leveraging Large Language, Language Models, Large Language, leveraging Large

备注: 11 pages, 3 figures

点击查看摘要

Abstract:This work presents a novel architecture for context-aware interactions within smart environments, leveraging Large Language Models (LLMs) to enhance user experiences. Our system integrates user location data obtained through UWB tags and sensor-equipped smart homes with real-time human activity recognition (HAR) to provide a comprehensive understanding of user context. This contextual information is then fed to an LLM-powered chatbot, enabling it to generate personalised interactions and recommendations based on the user's current activity and environment. This approach moves beyond traditional static chatbot interactions by dynamically adapting to the user's real-time situation. A case study conducted from a real-world dataset demonstrates the feasibility and effectiveness of our proposed architecture, showcasing its potential to create more intuitive and helpful interactions within smart homes. The results highlight the significant benefits of integrating LLM with real-time activity and location data to deliver personalised and contextually relevant user experiences.

73. 【2502.14451】Optimal word order for non-causal text generation with Large Language Models: the Spanish case

链接https://arxiv.org/abs/2502.14451

作者:Andrea Busto-Castiñeira,Silvia García-Méndez,Francisco de Arriba-Pérez,Francisco J. González-Castaño

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Natural Language Generation, zero-shot inference capabilities, Natural Language, progress in Large

备注

点击查看摘要

Abstract:Natural Language Generation (NLG) popularity has increased owing to the progress in Large Language Models (LLMs), with zero-shot inference capabilities. However, most neural systems utilize decoder-only causal (unidirectional) transformer models, which are effective for English but may reduce the richness of languages with less strict word order, subject omission, or different relative clause attachment preferences. This is the first work that analytically addresses optimal text generation order for non-causal language models. We present a novel Viterbi algorithm-based methodology for maximum likelihood word order estimation. We analyze the non-causal most-likelihood order probability for NLG in Spanish and, then, the probability of generating the same phrases with Spanish causal NLG. This comparative analysis reveals that causal NLG prefers English-like SVO structures. We also analyze the relationship between optimal generation order and causal left-to-right generation order using Spearman's rank correlation. Our results demonstrate that the ideal order predicted by the maximum likelihood estimator is not closely related to the causal order and may be influenced by the syntactic structure of the target sentence.

74. 【2502.14445】PredictaBoard: Benchmarking LLM Score Predictability

链接https://arxiv.org/abs/2502.14445

作者:Lorenzo Pacchiardi,Konstantinos Voudouris,Ben Slater,Fernando Martínez-Plumed,José Hernández-Orallo,Lexin Zhou,Wout Schellaert

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

关键词:Large Language Models, Large Language, Language Models, possessing impressive skills, demonstrating inconsistent success

备注

点击查看摘要

Abstract:Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at this https URL

75. 【2502.14444】An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization

链接https://arxiv.org/abs/2502.14444

作者:Sean Lester C. Benavides,Cid Antonio F. Masapol,Jonathan C. Morano,Dan Michael A. Cortez

类目:Computation and Language (cs.CL)

关键词:study enhances Jiang, detecting semantic similarities, enhances Jiang, detecting semantic, semantic similarities

备注: 11 pages, 5 figures, 1 table

点击查看摘要

Abstract:This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. By compressing extracted unigrams, the algorithm mitigates sliding window limitations inherent to gzip, improving compression efficiency and similarity detection. The optimized concatenation strategy replaces direct concatenation with the union of unigrams, reducing redundancy and enhancing the accuracy of Normalized Compression Distance (NCD) calculations. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents. Notably, these improvements are more pronounced in datasets with high-label diversity and complex text structures. The methodology achieves these results while maintaining computational efficiency, making it suitable for resource-constrained environments. This study provides a robust, scalable solution for text classification, emphasizing lightweight preprocessing techniques to achieve efficient compression, which in turn enables more accurate classification.

76. 【2502.14437】Natural Language Generation

链接https://arxiv.org/abs/2502.14437

作者:Ehud Reiter

类目:Computation and Language (cs.CL)

关键词:Natural Language Generation, NLG, Language Generation, broad overview, user requirements

备注: This is a preprint of the following work: Ehud Reiter, Natural Language Generation, 2024, Springer reproduced with permission of Springer Nature Switzerland AG. The final authenticated version is available online at: [this http URL](http://dx.doi.org/10.1007/978-3-031-68582-8)

点击查看摘要

Abstract:This book provides a broad overview of Natural Language Generation (NLG), including technology, user requirements, evaluation, and real-world applications. The focus is on concepts and insights which hopefully will remain relevant for many years, not on the latest LLM innovations. It draws on decades of work by the author and others on NLG. The book has the following chapters: Introduction to NLG; Rule-Based NLG; Machine Learning and Neural NLG; Requirements; Evaluation; Safety, Maintenance, and Testing; and Applications. All chapters include examples and anecdotes from the author's personal experiences, and end with a Further Reading section. The book should be especially useful to people working on applied NLG, including NLG researchers, people in other fields who want to use NLG, and commercial developers. It will not however be useful to people who want to understand the latest LLM technology. There is a companion site with more information at this https URL

Comments:
This is a preprint of the following work: Ehud Reiter, Natural Language Generation, 2024, Springer reproduced with permission of Springer Nature Switzerland AG. The final authenticated version is available online at: this http URL

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2502.14437 [cs.CL]

(or
arXiv:2502.14437v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2502.14437

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Book published by Springer in 2024

Related DOI:

https://doi.org/10.1007/978-3-031-68582-8

Focus to learn more

            DOI(s) linking to related resources

Submission history From: Ehud Reiter [view email] [v1]
Thu, 20 Feb 2025 10:41:34 UTC (2,680 KB)

77. 【2502.14429】Early-Exit and Instant Confidence Translation Quality Estimation

链接https://arxiv.org/abs/2502.14429

作者:Vilém Zouhar,Maike Züfle,Beni Egressy,Julius Cheng,Jan Niehues

类目:Computation and Language (cs.CL)

关键词:Quality estimation, estimation, Quality, quality estimation model, Instant Confidence COMET

备注

点击查看摘要

Abstract:Quality estimation is omnipresent in machine translation, for both evaluation and generation. Unfortunately, quality estimation models are often opaque and computationally expensive, making them impractical to be part of large-scale pipelines. In this work, we tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation. To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs. We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs. We also apply our model to machine translation reranking. We combine Early-Exit COMET with an upper confidence bound bandit algorithm to find the best candidate from a large pool without having to run the full evaluation model on all candidates. In both cases (evaluation and reranking) our methods reduce the required compute by 50% with very little degradation in performance.

78. 【2502.14427】oken-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

链接https://arxiv.org/abs/2502.14427

作者:Artem Vazhentsev,Lyudmila Rvanova,Ivan Lazichny,Alexander Panchenko,Maxim Panov,Timothy Baldwin,Artem Shelmanov

类目:Computation and Language (cs.CL)

关键词:eliciting truthful answers, large language models, eliciting truthful, truthful answers, answers from large

备注

点击查看摘要

Abstract:Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs). To date, information-based and consistency-based UQ have been the dominant UQ methods for text generation via LLMs. Density-based methods, despite being very effective for UQ in text classification with encoder-based models, have not been very successful with generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation and introduce a new supervised UQ method. Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, we demonstrate that our approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data, making it suitable for a wide range of LLM-based applications.

79. 【2502.14425】A Survey on Data Contamination for Large Language Models

链接https://arxiv.org/abs/2502.14425

作者:Yuxing Cheng,Yi Chang,Yuan Wu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated significant progress, Recent advancements, advancements in Large

备注

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data contamination-the unintended overlap between training and test datasets. This overlap has the potential to artificially inflate model performance, as LLMs are typically trained on extensive datasets scraped from publicly available sources. These datasets often inadvertently overlap with the benchmarks used for evaluation, leading to an overestimation of the models' true generalization capabilities. In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation, focusing on three strategies: data updating-based methods, data rewriting-based methods, and prevention-based methods. Specifically, we highlight dynamic benchmarks and LLM-driven evaluation methods. Finally, we categorize contamination detecting methods based on model information dependency: white-Box, gray-Box, and black-Box detection approaches. Our survey highlights the requirements for more rigorous evaluation protocols and proposes future directions for addressing data contamination challenges.

80. 【2502.14409】Unstructured Evidence Attribution for Long Context Query Focused Summarization

链接https://arxiv.org/abs/2502.14409

作者:Dustin Wright,Zain Muhammad Mujahid,Lu Wang,Isabelle Augenstein,David Jurgens

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, Large language, generating coherent summaries, capable of generating, generating coherent

备注: 24 pages; 21 figures; 5 tables

点击查看摘要

Abstract:Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query. Extracting and properly citing evidence spans could help improve the transparency and reliability of these summaries. At the same time, LLMs suffer from positional biases in terms of which information they understand and attend to, which could affect evidence citation. Whereas previous work has focused on evidence citation with predefined levels of granularity (e.g. sentence, paragraph, document, etc.), we propose the task of long-context query focused summarization with unstructured evidence citation. We show how existing systems struggle to generate and properly cite unstructured evidence from their context, and that evidence tends to be "lost-in-the-middle". To help mitigate this, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel domain-agnostic pipeline which can be used as supervision to adapt LLMs to this task. We demonstrate across 5 LLMs of different sizes and 4 datasets with varying document types and lengths that LLMs adapted with SUnsET data generate more relevant and factually consistent evidence than their base models, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries.

81. 【2502.14403】A Macro- and Micro-Hierarchical Transfer Learning Framework for Cross-Domain Fake News Detection

链接https://arxiv.org/abs/2502.14403

作者:Xuankai Yang,Yan Wang,Xiuzhen Zhang,Shoujin Wang,Huaxiong Wang,Kwok Yan Lam

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:mitigate domain shift, improve detection performance, aims to mitigate, shift and improve, fake news detection

备注: 11 pages, 8 figures

点击查看摘要

Abstract:Cross-domain fake news detection aims to mitigate domain shift and improve detection performance by transferring knowledge across domains. Existing approaches transfer knowledge based on news content and user engagements from a source domain to a target domain. However, these approaches face two main limitations, hindering effective knowledge transfer and optimal fake news detection performance. Firstly, from a micro perspective, they neglect the negative impact of veracity-irrelevant features in news content when transferring domain-shared features across domains. Secondly, from a macro perspective, existing approaches ignore the relationship between user engagement and news content, which reveals shared behaviors of common users across domains and can facilitate more effective knowledge transfer. To address these limitations, we propose a novel macro- and micro- hierarchical transfer learning framework (MMHT) for cross-domain fake news detection. Firstly, we propose a micro-hierarchical disentangling module to disentangle veracity-relevant and veracity-irrelevant features from news content in the source domain for improving fake news detection performance in the target domain. Secondly, we propose a macro-hierarchical transfer learning module to generate engagement features based on common users' shared behaviors in different domains for improving effectiveness of knowledge transfer. Extensive experiments on real-world datasets demonstrate that our framework significantly outperforms the state-of-the-art baselines.

82. 【2502.14394】Enhancing Portuguese Variety Identification with Cross-Domain Approaches

链接https://arxiv.org/abs/2502.14394

作者:Hugo Sousa,Rúben Almeida,Purificação Silvano,Inês Cantante,Ricardo Campos,Alípio Jorge

类目:Computation and Language (cs.CL)

关键词:produce coherent text, Recent advances, natural language processing, Brazilian Portuguese, Brazilian Portuguese corpora

备注: AAAI 2025

点击查看摘要

Abstract:Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

83. 【2502.14389】Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment

链接https://arxiv.org/abs/2502.14389

作者:Lucile Favero,Juan Antonio Pérez-Ortiz,Tanja Käser,Nuria Oliver

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:mining algorithms analyze, providing targeted feedback, students' argumentation skills, Argument mining algorithms, algorithms analyze

备注

点击查看摘要

Abstract:Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students' argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, offering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. These models' small size and open-source nature ensure accessibility, privacy, and computational efficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the Feedback Prize - Predicting Effective Arguments dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.

Subjects:

Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2502.14389 [cs.CL]

(or
arXiv:2502.14389v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2502.14389

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
84. 【2502.14385】radutor: Building a Variety Specific Translation Model

链接https://arxiv.org/abs/2502.14385

作者:Hugo Sousa,Satya Almasian,Ricardo Campos,Alípio Jorge

类目:Computation and Language (cs.CL)

关键词:European Portuguese, Portuguese, European, Brazilian Portuguese, Abstract

备注: AAAI 2025

点击查看摘要

Abstract:Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

85. 【2502.14383】Rumor Detection by Multi-task Suffix Learning based on Time-series Dual Sentiments

链接https://arxiv.org/abs/2502.14383

作者:Zhiwei Liu,Kailai Yang,Eduard Hovy,Sophia Ananiadou

类目:Computation and Language (cs.CL)

关键词:people lives, potentially leading, panic and fear, rumor detection, widespread dissemination

备注: work in progress

点击查看摘要

Abstract:The widespread dissemination of rumors on social media has a significant impact on people's lives, potentially leading to public panic and fear. Rumors often evoke specific sentiments, resonating with readers and prompting sharing. To effectively detect and track rumors, it is essential to observe the fine-grained sentiments of both source and response message pairs as the rumor evolves over time. However, current rumor detection methods fail to account for this aspect. In this paper, we propose MSuf, the first multi-task suffix learning framework for rumor detection and tracking using time series dual (coupled) sentiments. MSuf includes three modules: (1) an LLM to extract sentiment intensity features and sort them chronologically; (2) a module that fuses the sorted sentiment features with their source text word embeddings to obtain an aligned embedding; (3) two hard prompts are combined with the aligned vector to perform rumor detection and sentiment analysis using one frozen LLM. MSuf effectively enhances the performance of LLMs for rumor detection with only minimal parameter fine-tuning. Evaluating MSuf on four rumor detection benchmarks, we find significant improvements compared to other emotion-based methods.

86. 【2502.14380】Affinity and Diversity: A Unified Metric for Demonstration Selection via Internal Representations

链接https://arxiv.org/abs/2502.14380

作者:Mariko Kato,Hakaze Cho,Yoshihiro Sakai,Naoya Inoue

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:In-Context Learning, performance of In-Context, highly sensitive, Learning, selected demonstrations

备注: 8 pages, 10 figures

点击查看摘要

Abstract:The performance of In-Context Learning (ICL) is highly sensitive to the selected demonstrations. Existing approaches to demonstration selection optimize different objectives, yielding inconsistent results. To address this, we propose a unified metric--affinity and diversity--that leverages ICL model's internal representations. Our experiments show that both affinity and diversity strongly correlate with test accuracies, indicating their effectiveness for demonstration selection. Moreover, we show that our proposed metrics align well with various previous works to unify the inconsistency.

87. 【2502.14376】A Similarity Paradigm Through Textual Regularization Without Forgetting

链接https://arxiv.org/abs/2502.14376

作者:Fangming Cui,Jan Fong,Rongfei Zeng,Xinmei Tian,Jun Yu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:adapting pre-trained visual-language, pre-trained visual-language models, hand-crafted prompts, adapting pre-trained, pre-trained visual-language

备注

点击查看摘要

Abstract:Prompt learning has emerged as a promising method for adapting pre-trained visual-language models (VLMs) to a range of downstream tasks. While optimizing the context can be effective for improving performance on specific tasks, it can often lead to poor generalization performance on unseen classes or datasets sampled from different distributions. It may be attributed to the fact that textual prompts tend to overfit downstream data distributions, leading to the forgetting of generalized knowledge derived from hand-crafted prompts. In this paper, we propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting. SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework. 1) To avoid forgetting general textual knowledge, we introduce the optimal transport as a textual regularization to finely ensure approximation with hand-crafted features and tuning textual features. 2) In order to continuously unleash the general ability of multiple hand-crafted prompts, we propose a similarity paradigm for natural alignment score and adversarial alignment score to improve model robustness for generalization. Both modules share a common objective in addressing generalization issues, aiming to maximize the generalization capability derived from multiple hand-crafted prompts. Four representative tasks (i.e., non-generalization few-shot learning, base-to-novel generalization, cross-dataset generalization, domain generalization) across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.

88. 【2502.14366】Entropy-UID: A Method for Optimizing Information Density

链接https://arxiv.org/abs/2502.14366

作者:Xinpeng Shou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:efficient information flow, Uniform Information Density, flow is essential, essential for optimizing, optimizing language generation

备注: 5pages, 1 figures, submitting to ACL 2025

点击查看摘要

Abstract:Balanced and efficient information flow is essential for optimizing language generation models. In this work, we propose Entropy-UID, a new token selection method that balances entropy and Uniform Information Density (UID) principles for enhanced efficiency of text generation. Our approach adaptively adjusts token selection by jointly minimizing entropy and surprisal, promoting more even information distribution across generated sequences. Theoretical validation demonstrates that Entropy-UID optimally reduces information spikes while maintaining fluency and coherence. The method has been evulated using information-theoretic metrics on multiple benchmark datasets, including WikiText-2, OpenWebText, and WMT. Experimental results show that Entropy-UID achieves lower surprisal and entropy variance compared to standard GPT-2 and alternative heuristics, leading to more balanced and human-like text generation. Our findings point towards the potential of leveraging information-theoretic constraints to refine token selection strategies in autoregressive language models.

89. 【2502.14359】riangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

链接https://arxiv.org/abs/2502.14359

作者:Filippo Momentè,Alessandro Suglia,Mario Giulianelli,Ambra Ferrari,Alexander Koller,Oliver Lemon,David Schlangen,Raquel Fernández,Raffaella Bernardi

类目:Computation and Language (cs.CL)

关键词:MMLU and BBH, large question-answering benchmarks, Signalling Games, evaluation paradigms, large question-answering

备注

点击查看摘要

Abstract:We examine three evaluation paradigms: large question-answering benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.

90. 【2502.14356】Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

链接https://arxiv.org/abs/2502.14356

作者:Huimin Xu,Xin Mao,Feng-Lin Li,Xiaobao Wu,Wang Chen,Wei Zhang,Anh Tuan Luu

类目:Computation and Language (cs.CL)

关键词:Direct Preference Optimization, Direct Preference, Preference Optimization, struggles with long-chain, Direct

备注

点击查看摘要

Abstract:Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.

91. 【2502.14354】Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment

链接https://arxiv.org/abs/2502.14354

作者:Moxin Li,Yuantao Zhang,Wenjie Wang,Wentao Shi,Zhuo Liu,Fuli Feng,Tat-Seng Chua

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Direct Preference Optimization, multiple human preference, align LLMs' responses, Direct Preference, human preference objectives

备注: Under review

点击查看摘要

Abstract:Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at \url{this https URL}.

92. 【2502.14352】SR-LLM: Rethinking the Structured Representation in Large Language Model

链接https://arxiv.org/abs/2502.14352

作者:Jiahuan Zhang,Tianheng Wang,Hanqing Wu,Ziyi Huang,Yulong Wu,Dongbai Chen,Linfeng Song,Yue Zhang,Guozheng Rao,Kaicheng Yu

类目:Computation and Language (cs.CL)

关键词:Abstract Meaning Representation, Abstract Meaning, Meaning Representation, computational linguistics, exemplified by Abstract

备注

点击查看摘要

Abstract:Structured representations, exemplified by Abstract Meaning Representation (AMR), have long been pivotal in computational linguistics. However, their role remains ambiguous in the Large Language Models (LLMs) era. Initial attempts to integrate structured representation into LLMs via a zero-shot setting yielded inferior performance. We hypothesize that such a decline stems from the structure information being passed into LLMs in a code format unfamiliar to LLMs' training corpora. Consequently, we propose SR-LLM, an innovative framework with two settings to explore a superior way of integrating structured representation with LLMs from training-free and training-dependent perspectives. The former integrates structural information through natural language descriptions in LLM prompts, whereas its counterpart augments the model's inference capability through fine-tuning on linguistically described structured representations. Performance improvements were observed in widely downstream datasets, with particularly notable gains of 3.17% and 12.38% in PAWS. To the best of our knowledge, this work represents the pioneering demonstration that leveraging structural representations can substantially enhance LLMs' inference capability. We hope that our work sheds light and encourages future research to enhance the reasoning and interoperability of LLMs by structure data.

93. 【2502.14340】Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

链接https://arxiv.org/abs/2502.14340

作者:Ruichen Shao,Bei Li,Gangao Liu,Yang Chen,Xiang Zhou,Jingang Wang,Xunliang Cai,Peng Li

类目:Computation and Language (cs.CL)

关键词:Direct Preference Optimization, aligning large language, large language models, Direct Preference, gained attention

备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{this https URL}.

94. 【2502.14338】English Please: Evaluating Machine Translation for Multilingual Bug Reports

链接https://arxiv.org/abs/2502.14338

作者:Avinash Patil,Aryan Jadon

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:global software development, Visual Studio Code, Accurate translation, AWS Translate, software development

备注: 8 Pages, 4 Figures, 3 Tables

点击查看摘要

Abstract:Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and ChatGPT using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To thoroughly assess the accuracy and effectiveness of each system, we employ multiple machine translation metrics, including BLEU, BERTScore, COMET, METEOR, and ROUGE. Our findings indicate that DeepL consistently outperforms the other systems across most automatic metrics, demonstrating strong lexical and semantic alignment. AWS Translate performs competitively, particularly in METEOR, while ChatGPT lags in key metrics. This study underscores the importance of domain adaptation for translating technical texts and offers guidance for integrating automated translation into bug-triaging workflows. Moreover, our results establish a foundation for future research to refine machine translation solutions for specialized engineering contexts. The code and dataset for this paper are available at GitHub: this https URL.

95. 【2502.14335】Information Types in Product Reviews

链接https://arxiv.org/abs/2502.14335

作者:Ori Shapira,Yuval Pinter

类目:Computation and Language (cs.CL)

关键词:text is communicated, Information in text, Information, types of information, product review domain

备注

点击查看摘要

Abstract:Information in text is communicated in a way that supports a goal for its reader. Product reviews, for example, contain opinions, tips, product descriptions, and many other types of information that provide both direct insights, as well as unexpected signals for downstream applications. We devise a typology of 24 communicative goals in sentences from the product review domain, and employ a zero-shot multi-label classifier that facilitates large-scale analyses of review data. In our experiments, we find that the combination of classes in the typology forecasts helpfulness and sentiment of reviews, while supplying explanations for these decisions. In addition, our typology enables analysis of review intent, effectiveness and rhetorical structure. Characterizing the types of information in reviews unlocks many opportunities for more effective consumption of this genre.

96. 【2502.14333】A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

链接https://arxiv.org/abs/2502.14333

作者:Ting-Ruen Wei,Haowei Liu,Xuyang Wu,Yi Fang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, encouraging problem solving, Recent progress, language models, progress in large

备注

点击查看摘要

Abstract:Recent progress in large language models (LLM) found chain-of-thought prompting strategies to improve the reasoning ability of LLMs by encouraging problem solving through multiple steps. Therefore, subsequent research aimed to integrate the multi-step reasoning process into the LLM itself through process rewards as feedback and achieved improvements over prompting strategies. Due to the cost of step-level annotation, some turn to outcome rewards as feedback. Aside from these training-based approaches, training-free techniques leverage frozen LLMs or external tools for feedback at each step to enhance the reasoning process. With the abundance of work in mathematics due to its logical nature, we present a survey of strategies utilizing feedback at the step and outcome levels to enhance multi-step math reasoning for LLMs. As multi-step reasoning emerges a crucial component in scaling LLMs, we hope to establish its foundation for easier understanding and empower further research.

97. 【2502.14321】Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

链接https://arxiv.org/abs/2502.14321

作者:Bingyu Yan,Xiaoming Zhang,Litian Zhang,Lian Zhang,Ziyi Zhou,Dezhuang Miao,Chaozhuo Li

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:Large Language Models, recently demonstrated remarkable, demonstrated remarkable capabilities, Large Language, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated remarkable capabilities in reasoning, planning, and decision-making. Building upon these strengths, researchers have begun incorporating LLMs into multi-agent systems (MAS), where agents collaborate or compete through natural language interactions to tackle tasks beyond the scope of single-agent setups. In this survey, we present a communication-centric perspective on LLM-based multi-agent systems, examining key system-level features such as architecture design and communication goals, as well as internal mechanisms like communication strategies, paradigms, objects and content. We illustrate how these communication elements interplay to enable collective intelligence and flexible collaboration. Furthermore, we discuss prominent challenges, including scalability, security, and multimodal integration, and propose directions for future work to advance research in this emerging domain. Ultimately, this survey serves as a catalyst for further innovation, fostering more robust, scalable, and intelligent multi-agent systems across diverse application domains.

98. 【2502.14318】Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

链接https://arxiv.org/abs/2502.14318

作者:James Fodor

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, regularly demonstrate, wide range, Large

备注: 10 pages

点击查看摘要

Abstract:Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive capabilities have likewise rapidly improved, with the implication that such models are becoming progressively more capable on various real-world tasks. Here I summarise theoretical and empirical considerations to challenge this narrative. I argue that inherent limitations with the benchmarking paradigm, along with specific limitations of existing benchmarks, render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks. I also contend that alternative methods for assessing LLM capabilities, including adversarial stimuli and interpretability techniques, have shown that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences. I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.

99. 【2502.14317】ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

链接https://arxiv.org/abs/2502.14317

作者:Jing Xiong,Jianghan Shen,Chuanyang Zheng,Zhongwei Wan,Chenyang Zhao,Chiwun Yang,Fanghua Ye,Hongxia Yang,Lingpeng Kong,Ngai Wong

类目:Computation and Language (cs.CL)

关键词:Efficiently handling long, handling long contexts, large language models, handling long, crucial for large

备注: We will release the code soon

点击查看摘要

Abstract:Efficiently handling long contexts is crucial for large language models (LLMs). While rotary position embeddings (RoPEs) enhance length generalization, effective length extrapolation remains challenging and often requires costly fine-tuning. In contrast, recent training-free approaches suffer from the attention sink phenomenon, leading to severe performance degradation. In this paper, we introduce ParallelComp, a novel training-free method for long-context extrapolation that extends LLMs' context length from 4K to 128K while maintaining high throughput and preserving perplexity, and integrates seamlessly with Flash Attention. Our analysis offers new insights into attention biases in parallel attention mechanisms and provides practical solutions to tackle these challenges. To mitigate the attention sink issue, we propose an attention calibration strategy that reduces biases, ensuring more stable long-range attention. Additionally, we introduce a chunk eviction strategy to efficiently manage ultra-long contexts on a single A100 80GB GPU. To further enhance efficiency, we propose a parallel KV cache eviction technique, which improves chunk throughput by 1.76x, thereby achieving a 23.50x acceleration in the prefilling stage with negligible performance loss due to attention calibration. Furthermore, ParallelComp achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model trained on 8K-length context, outperforming powerful closed-source models such as Claude-2 and Kimi-Chat.

100. 【2502.14315】Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension

链接https://arxiv.org/abs/2502.14315

作者:Amir Hossein Yari,Fajri Koto

类目:Computation and Language (cs.CL)

关键词:remains largely unexplored, multilingual large language, natural language processing, culture-specific content, remains largely

备注

点击查看摘要

Abstract:Despite the impressive performance of multilingual large language models (mLLMs) in various natural language processing tasks, their ability to understand procedural texts, particularly those with culture-specific content, remains largely unexplored. Texts describing cultural procedures, including rituals, traditional craftsmanship, and social etiquette, require an inherent understanding of cultural context, presenting a significant challenge for mLLMs. In this work, we introduce CAPTex, a benchmark designed to evaluate mLLMs' ability to process and reason about culturally diverse procedural texts across multiple languages using various methodologies to assess their performance. Our findings indicate that (1) mLLMs face difficulties with culturally contextualized procedural texts, showing notable performance declines in low-resource languages, (2) model performance fluctuates across cultural domains, with some areas presenting greater difficulties, and (3) language models exhibit better performance on multiple-choice tasks within conversational frameworks compared to direct questioning. These results underscore the current limitations of mLLMs in handling culturally nuanced procedural texts and highlight the need for culturally aware benchmarks like CAPTex to enhance their adaptability and comprehension across diverse linguistic and cultural landscapes.

101. 【2502.14311】he Impact and Feasibility of Self-Confidence Shaping for AI-Assisted Decision-Making

链接https://arxiv.org/abs/2502.14311

作者:Takehiro Takayanagi,Ryuji Hashimoto,Chung-Chi Chen,Kiyoshi Izumi

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:AI-assisted decision-making, finance and healthcare, crucial but challenging, challenging for humans, humans to appropriately

备注

点击查看摘要

Abstract:In AI-assisted decision-making, it is crucial but challenging for humans to appropriately rely on AI, especially in high-stakes domains such as finance and healthcare. This paper addresses this problem from a human-centered perspective by presenting an intervention for self-confidence shaping, designed to calibrate self-confidence at a targeted level. We first demonstrate the impact of self-confidence shaping by quantifying the upper-bound improvement in human-AI team performance. Our behavioral experiments with 121 participants show that self-confidence shaping can improve human-AI team performance by nearly 50% by mitigating both over- and under-reliance on AI. We then introduce a self-confidence prediction task to identify when our intervention is needed. Our results show that simple machine-learning models achieve 67% accuracy in predicting self-confidence. We further illustrate the feasibility of such interventions. The observed relationship between sentiment and self-confidence suggests that modifying sentiment could be a viable strategy for shaping self-confidence. Finally, we outline future research directions to support the deployment of self-confidence shaping in a real-world scenario for effective human-AI collaboration.

102. 【2502.14302】MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

链接https://arxiv.org/abs/2502.14302

作者:Shrey Pandit,Jiawei Xu,Junyuan Hong,Zhangyang Wang,Tianlong Chen,Kaidi Xu,Ying Ding

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Advancements in Large, Large Language, question-answering necessitate rigorous, necessitate rigorous evaluation

备注: Code and dataset are available at [this https URL](https://medhallu.github.io/)

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

103. 【2502.14301】SEA-HELM: Southeast Asian Holistic Evaluation of Language Models

链接https://arxiv.org/abs/2502.14301

作者:Yosephine Susanto,Adithya Venkatadri Hulagadri,Jann Railey Montalan,Jian Gang Ngui,Xian Bin Yong,Weiqi Leong,Hamsawardhini Rengarajan,Peerat Limkonchotiwat,Yifan Mai,William Chandra Tjhi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models, Large Language, rapid emergence, SEA languages

备注

点击查看摘要

Abstract:With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a comprehensive and authentic evaluation suite for the SEA languages has not been developed thus far. Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasizes SEA languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models' multilingual and multicultural performance in a systematic and user-friendly manner.

104. 【2502.14289】Drift: Decoding-time Personalized Alignments with Implicit User Preferences

链接https://arxiv.org/abs/2502.14289

作者:Minbeom Kim,Kang-il Lee,Seongho Joo,Hwaran Lee,Minbeom Kim

类目:Computation and Language (cs.CL)

关键词:large language models, alignments for individual, long-standing goal, goal in large, large language

备注: 19 pages, 6 figures

点击查看摘要

Abstract:Personalized alignments for individual users have been a long-standing goal in large language models (LLMs). We introduce Drift, a novel framework that personalizes LLMs at decoding time with implicit user preferences. Traditional Reinforcement Learning from Human Feedback (RLHF) requires thousands of annotated examples and expensive gradient updates. In contrast, Drift personalizes LLMs in a training-free manner, using only a few dozen examples to steer a frozen model through efficient preference modeling. Our approach models user preferences as a composition of predefined, interpretable attributes and aligns them at decoding time to enable personalized generation. Experiments on both a synthetic persona dataset (Perspective) and a real human-annotated dataset (PRISM) demonstrate that Drift significantly outperforms RLHF baselines while using only 50-100 examples. Our results and analysis show that Drift is both computationally efficient and interpretable.

105. 【2502.14285】Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach

链接https://arxiv.org/abs/2502.14285

作者:Yurong Wu,Fangwen Mu,Qiuhong Zhang,Jinjing Zhao,Xinrun Xu,Lingrui Mei,Yang Wu,Lin Shi,Junjie Wang,Zhiming Ding,Yiwei Wang

类目:Computation and Language (cs.CL)

关键词:significant intellectual property, intellectual property concern, vendors entice users, showcasing sample images, recent years

备注: 14 pages,8 figures,4 tables

点击查看摘要

Abstract:Prompt trading has emerged as a significant intellectual property concern in recent years, where vendors entice users by showcasing sample images before selling prompt templates that can generate similar images. This work investigates a critical security vulnerability: attackers can steal prompt templates using only a limited number of sample images. To investigate this threat, we introduce Prism, a prompt-stealing benchmark consisting of 50 templates and 450 images, organized into Easy and Hard difficulty levels. To identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a novel template stealing method that operates without model fine-tuning by leveraging differential evolution algorithms. The system first initializes population sets using multimodal large language models (MLLMs) based on predefined patterns, then iteratively generates enhanced offspring through MLLMs. During evolution, EvoStealer identifies common features across offspring to derive generalized templates. Our comprehensive evaluation conducted across open-source (INTERNVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini) demonstrates that EvoStealer's stolen templates can reproduce images highly similar to originals and effectively generalize to other subjects, significantly outperforming baseline methods with an average improvement of over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template stealing with negligible computational expenses. Our code and dataset are available at this https URL.

106. 【2502.14280】EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

链接https://arxiv.org/abs/2502.14280

作者:Subhajit Chaudhury,Payel Das,Sarathkrishna Swaminathan,Georgios Kollias,Elliot Nelson,Khushbu Pahwa,Tejaswini Pedapati,Igor Melnyk,Matthew Riemer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models, Large Language, yielded impressive successes, Recent advances

备注

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.

107. 【2502.14276】STeCa: Step-level Trajectory Calibration for LLM Agent Learning

链接https://arxiv.org/abs/2502.14276

作者:Hanlin Wang,Jian Wang,Chak Tou Leong,Wenjie Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, Large language, tackling complex tasks, language model, shown promise

备注

点击查看摘要

Abstract:Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations and preference learning through exploratory trajectory sampling. However, these methods often struggle in long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. These calibrated trajectories, together with successful trajectory data, are utilized for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that step-level calibration enables agents to complete tasks with greater robustness. Our code and data are available at this https URL.

108. 【2502.14275】Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

链接https://arxiv.org/abs/2502.14275

作者:Jiaxi Li,Yiwei Wang,Kai Zhang,Yujun Cai,Bryan Hooi,Nanyun Peng,Kai-Wei Chang,Jin Lu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, downstream task domains, medical, medical knowledge

备注: 15 pages, 11 figures

点击查看摘要

Abstract:Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate how well LLMs encode, retain, and recall fundamental medical facts. To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ is constructed from the Unified Medical Language System (UMLS), a large-scale repository of standardized biomedical vocabularies and knowledge graphs. We frame knowledge assessment as a binary judgment task, requiring LLMs to verify the correctness of medical statements extracted from reliable and structured knowledge sources. Our experiments reveal that LLMs struggle with factual medical knowledge retention, exhibiting significant performance variance across different semantic categories, particularly for rare medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.

Comments:
15 pages, 11 figures

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2502.14275 [cs.CL]

(or
arXiv:2502.14275v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2502.14275

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
109. 【2502.14272】Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models

链接https://arxiv.org/abs/2502.14272

作者:Yanggan Gu,Junzhuo Li,Sirui Huang,Xin Zou,Zhenghua Li,Xuming Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Aligning small language, typically involves distilling, Aligning small, involves distilling preference, distilling preference knowledge

备注: Under review

点击查看摘要

Abstract:Aligning small language models (SLMs) with human values typically involves distilling preference knowledge from large language models (LLMs). However, existing distillation methods model preference knowledge in teacher LLMs by comparing pairwise responses, overlooking the extent of difference between responses. This limitation hinders student SLMs from capturing the nuanced preferences for multiple responses. In this paper, we propose a Preference-Aligned Distillation (PAD) framework, which models teacher's preference knowledge as a probability distribution over all potential preferences, thereby providing more nuanced supervisory signals. Our insight in developing PAD is rooted in the demonstration that language models can serve as reward functions, reflecting their intrinsic preferences. Based on this, PAD comprises three key steps: (1) sampling diverse responses using high-temperature; (2) computing rewards for both teacher and student to construct their intrinsic preference; and (3) training the student's intrinsic preference distribution to align with the teacher's. Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches, achieving over 20\% improvement on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human preferences. Notably, on MT-Bench, using the \textsc{Gemma} model family, the student trained by PAD surpasses its teacher, further validating the effectiveness of our PAD.

110. 【2502.14271】PaperHelper: Knowledge-Based LLM QA Paper Reading Assistant

链接https://arxiv.org/abs/2502.14271

作者:Congrui Yin,Evan Wei,Zhongxing Zhang,Zaifu Zhan

类目:Computation and Language (cs.CL)

关键词:paper reading assistant, potent tool designed, understanding scientific literature, paper reading, reading assistant

备注

点击查看摘要

Abstract:In the paper, we introduce a paper reading assistant, PaperHelper, a potent tool designed to enhance the capabilities of researchers in efficiently browsing and understanding scientific literature. Utilizing the Retrieval-Augmented Generation (RAG) framework, PaperHelper effectively minimizes hallucinations commonly encountered in large language models (LLMs), optimizing the extraction of accurate, high-quality knowledge. The implementation of advanced technologies such as RAFT and RAG Fusion significantly boosts the performance, accuracy, and reliability of the LLMs-based literature review process. Additionally, PaperHelper features a user-friendly interface that facilitates the batch downloading of documents and uses the Mermaid format to illustrate structural relationships between documents. Experimental results demonstrate that PaperHelper, based on a fine-tuned GPT-4 API, achieves an F1 Score of 60.04, with a latency of only 5.8 seconds, outperforming the basic RAG model by 7\% in F1 Score.

111. 【2502.14268】MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels

链接https://arxiv.org/abs/2502.14268

作者:Xiaoou Liu,Zhen Lin,Longchao Da,Chacha Chen,Shubhendu Trivedi,Hua Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, require robust confidence, Large Language, Language Models, robust confidence estimation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.

112. 【2502.14258】Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information

链接https://arxiv.org/abs/2502.14258

作者:Yein Park,Chanwoong Yoon,Jungwoo Park,Minbyul Jeong,Jaewoo Kang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:facts remains underexplored, temporally changing facts, changing facts remains, handle temporally changing, widely investigated

备注

点击查看摘要

Abstract:While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads primarily responsible for processing temporal knowledge through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model's ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions ("In 2004") but also textual aliases ("In the year ..."), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.

113. 【2502.14255】Effects of Prompt Length on Domain-specific Tasks for Large Language Models

链接https://arxiv.org/abs/2502.14255

作者:Qibang Liu,Wenzhe Wang,Jeffrey Willard

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, garnered significant attention, natural language tasks, natural language

备注

点击查看摘要

Abstract:In recent years, Large Language Models have garnered significant attention for their strong performance in various natural language tasks, such as machine translation and question answering. These models demonstrate an impressive ability to generalize across diverse tasks. However, their effectiveness in tackling domain-specific tasks, such as financial sentiment analysis and monetary policy understanding, remains a topic of debate, as these tasks often require specialized knowledge and precise reasoning. To address such challenges, researchers design various prompts to unlock the models' abilities. By carefully crafting input prompts, researchers can guide these models to produce more accurate responses. Consequently, prompt engineering has become a key focus of study. Despite the advancements in both models and prompt engineering, the relationship between the two-specifically, how prompt design impacts models' ability to perform domain-specific tasks-remains underexplored. This paper aims to bridge this research gap.

114. 【2502.14245】Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering

链接https://arxiv.org/abs/2502.14245

作者:Rongzhi Zhu,Xiangyu Liu,Zequn Sun,Yiwei Wang,Wei Hu

类目:Computation and Language (cs.CL)

关键词:LLMs' sub-question decomposition, identify a critical, missed in LLMs', multi-hop question answering, critical problem

备注

点击查看摘要

Abstract:In this paper, we identify a critical problem, "lost-in-retrieval", in retrieval-augmented multi-hop question answering (QA): the key entities are missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly degrades the retrieval performance, which disrupts the reasoning chain and leads to the incorrect answers. To resolve this problem, we propose a progressive retrieval and rewriting method, namely ChainRAG, which sequentially handles each sub-question by completing missing key entities and retrieving relevant sentences from a sentence graph for answer generation. Each step in our retrieval and rewriting process builds upon the previous one, creating a seamless chain that leads to accurate retrieval and answers. Finally, all retrieved sentences and sub-question answers are integrated to generate a comprehensive answer to the original question. We evaluate ChainRAG on three multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG consistently outperforms baselines in both effectiveness and efficiency.

115. 【2502.14211】ransfer-Prompting: Enhancing Cross-Task Adaptation in Large Language Models via Dual-Stage Prompts Optimization

链接https://arxiv.org/abs/2502.14211

作者:Yupeng Chang,Yi Chang,Yuan Wu

类目:Computation and Language (cs.CL)

关键词:Large language models, balancing multiple high-level, Large language, face significant challenges, maintaining efficient task

备注: 17 pages

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges when balancing multiple high-level objectives, such as generating coherent, relevant, and high-quality responses while maintaining efficient task adaptation across diverse tasks. To address these challenges, we introduce Transfer-Prompting, a novel two-stage framework designed to enhance cross-task adaptation in prompt generation. The framework comprises two key components: (1) source prompt construction, which refines the original prompts on source task datasets to generate source prompts with enhanced generalization ability, and (2) target prompt generation, which enhances cross-task adaptation of target prompts by fine-tuning a set of high-scored source prompts on task-specific datasets. In each optimization cycle, a reference LLM generates candidate prompts based on historical prompt-score pairs and task descriptions in our designed reference prompt. These candidate prompts are refined iteratively, while a scorer LLM evaluates their effectiveness using the multi-dimensional metrics designed in the objective prompts evaluator-a novel contribution in this work that provides a holistic evaluation of prompt quality and task performance. This feedback loop facilitates continuous refinement, optimizing both prompt quality and task-specific outcomes. We validate Transfer-Prompting through extensive experiments across 25 LLMs, including 7 foundational models and 18 specialized models, evaluated on 9 diverse datasets. The results demonstrate that Transfer-Prompting significantly improves task-specific performance, highlighting its potential for enhancing cross-task adaptation in LLMs. The code is available at this https URL.

116. 【2502.14204】On-the-fly Preference Alignment via Principle-Guided Decoding

链接https://arxiv.org/abs/2502.14204

作者:Mingye Zhu,Yi Liu,Lei Zhang,Junbo Guo,Zhendong Mao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:rapidly expanding landscape, aligning model generations, large language models, increasingly important, rapidly expanding

备注: Accepted to ICLR 2025

点击查看摘要

Abstract:With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final aligned policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model's predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines.

117. 【2502.14202】Do LLMs Consider Security? An Empirical Study on Responses to Programming Questions

链接https://arxiv.org/abs/2502.14202

作者:Amirali Sajadi,Binh Le,Anh Nguyen,Kostadin Damevski,Preetha Chatterjee

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLM-generated content, widespread adoption, adoption of conversational, software development, development has raised

备注

点击查看摘要

Abstract:The widespread adoption of conversational LLMs for software development has raised new security concerns regarding the safety of LLM-generated content. Our motivational study outlines ChatGPT's potential in volunteering context-specific information to the developers, promoting safe coding practices. Motivated by this finding, we conduct a study to evaluate the degree of security awareness exhibited by three prominent LLMs: Claude 3, GPT-4, and Llama 3. We prompt these LLMs with Stack Overflow questions that contain vulnerable code to evaluate whether they merely provide answers to the questions or if they also warn users about the insecure code, thereby demonstrating a degree of security awareness. Further, we assess whether LLM responses provide information about the causes, exploits, and the potential fixes of the vulnerability, to help raise users' awareness. Our findings show that all three models struggle to accurately detect and warn users about vulnerabilities, achieving a detection rate of only 12.6% to 40% across our datasets. We also observe that the LLMs tend to identify certain types of vulnerabilities related to sensitive information exposure and improper input neutralization much more frequently than other types, such as those involving external control of file names or paths. Furthermore, when LLMs do issue security warnings, they often provide more information on the causes, exploits, and fixes of vulnerabilities compared to Stack Overflow responses. Finally, we provide an in-depth discussion on the implications of our findings and present a CLI-based prompting tool that can be used to generate significantly more secure LLM responses.

118. 【2502.14192】NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM

链接https://arxiv.org/abs/2502.14192

作者:Jiayin Lan,Jiaqi Li,Baoxin Wang,Ming Liu,Dayong Wu,Shijin Wang,Bing Qin

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词:Large language models, Large language, language models, widely applied, Large

备注

点击查看摘要

Abstract:Large language models (LLMs) have been widely applied in question answering over scientific research papers. To enhance the professionalism and accuracy of responses, many studies employ external knowledge augmentation. However, existing structures of external knowledge in scientific literature often focus solely on either paper entities or domain concepts, neglecting the intrinsic connections between papers through shared domain concepts. This results in less comprehensive and specific answers when addressing questions that combine papers and concepts. To address this, we propose a novel knowledge graph framework that captures deep conceptual relations between academic papers, constructing a relational network via intra-paper semantic elements and inter-paper citation relations. Using a few-shot knowledge graph construction method based on LLM, we develop NLP-AKG, an academic knowledge graph for the NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 papers in ACL Anthology. Based on this, we propose a 'sub-graph community summary' method and validate its effectiveness on three NLP scientific literature question answering datasets.

119. 【2502.14189】QUAD-LLM-MLTC: Large Language Models Ensemble Learning for Healthcare Text Multi-Label Classification

链接https://arxiv.org/abs/2502.14189

作者:Hajar Sakai,Sarah S. Lam

类目:Computation and Language (cs.CL)

关键词:automated Multi-Label Text, collected healthcare textual, Large Language Models, Natural Language Processing, nuanced nature

备注

点击查看摘要

Abstract:The escalating volume of collected healthcare textual data presents a unique challenge for automated Multi-Label Text Classification (MLTC), which is primarily due to the scarcity of annotated texts for training and their nuanced nature. Traditional machine learning models often fail to fully capture the array of expressed topics. However, Large Language Models (LLMs) have demonstrated remarkable effectiveness across numerous Natural Language Processing (NLP) tasks in various domains, which show impressive computational efficiency and suitability for unsupervised learning through prompt engineering. Consequently, these LLMs promise an effective MLTC of medical narratives. However, when dealing with various labels, different prompts can be relevant depending on the topic. To address these challenges, the proposed approach, QUAD-LLM-MLTC, leverages the strengths of four LLMs: GPT-4o, BERT, PEGASUS, and BART. QUAD-LLM-MLTC operates in a sequential pipeline in which BERT extracts key tokens, PEGASUS augments textual data, GPT-4o classifies, and BART provides topics' assignment probabilities, which results in four classifications, all in a 0-shot setting. The outputs are then combined using ensemble learning and processed through a meta-classifier to produce the final MLTC result. The approach is evaluated using three samples of annotated texts, which contrast it with traditional and single-model methods. The results show significant improvements across the majority of the topics in the classification's F1 score and consistency (F1 and Micro-F1 scores of 78.17% and 80.16% with standard deviations of 0.025 and 0.011, respectively). This research advances MLTC using LLMs and provides an efficient and scalable solution to rapidly categorize healthcare-related text data without further training.

120. 【2502.14187】Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization

链接https://arxiv.org/abs/2502.14187

作者:Fernando Spadea,Oshani Seneviratne

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Direct Preference Optimization, evaluate Kahneman-Tversky Optimization, Preference Optimization, Direct Preference, large language models

备注

点击查看摘要

Abstract:We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings, comparing it against Direct Preference Optimization (DPO). Using Alpaca-7B as the base model, we fine-tune on a realistic dataset under both methods and evaluate performance using MT-Bench-1, Vicuna, and AdvBench benchmarks. Additionally, we introduce a redistributed dataset setup, where only KTO is applicable due to its ability to handle single-response feedback, unlike DPO's reliance on paired responses. Our results demonstrate that KTO, in both its original (KTOO) and redistributed (KTOR) configurations, consistently outperforms DPO across all benchmarks. In the redistributed setup, KTO further validates its flexibility and resilience by maintaining superior performance in scenarios where DPO cannot be applied. These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.

121. 【2502.14180】On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems

链接https://arxiv.org/abs/2502.14180

作者:Shokhrukh Ibragimov,Arnulf Jentzen,Benno Kuckuck

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:first-order logic statements, first-order logic, generating first-order logic, multiple dimensions, logic statements

备注: 67 pages, 24 figures

点击查看摘要

Abstract:We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at this https URL.

122. 【2502.14171】Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction

链接https://arxiv.org/abs/2502.14171

作者:Mohammadmahdi Jafari,Devin Yuncheng Hua,Hao Xue,Flora Salim

类目:Computation and Language (cs.CL)

关键词:agentic Artificial Intelligence, Natural language interaction, Artificial Intelligence, Large Language Models, agentic Artificial

备注

点击查看摘要

Abstract:Natural language interaction with agentic Artificial Intelligence (AI), driven by Large Language Models (LLMs), is expected to remain a dominant paradigm in the near future. While humans instinctively align their communication with mental states -- an ability known as Theory of Mind (ToM), current LLM powered systems exhibit significant limitations in this regard. This study examines the extent to which open source language models (LLaMA) can capture and preserve ToM related information and how effectively it contributes to consistent ToM reasoning in generated responses. We further investigate whether explicit manipulation of ToM related components, such as beliefs, desires, and intentions, can enhance response alignment. Experiments on two LLaMA 3 variants demonstrate that incorporating ToM informed alignment improves response quality, achieving win rates of 67 and 63 percent for the 3B and 8B models, respectively. These findings highlight the potential of ToM driven strategies to improve alignment in LLM based conversational agents.

123. 【2502.14155】Giving AI Personalities Leads to More Human-Like Reasoning

链接https://arxiv.org/abs/2502.14155

作者:Animesh Nighojkar,Bekhzodbek Moydinboyev,My Duong,John Licato

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large Language Models, Natural Language Inference, computational cognitive modeling, significant challenge, human

备注

点击查看摘要

Abstract:In computational cognitive modeling, capturing the full spectrum of human judgment and decision-making processes, beyond just optimal behaviors, is a significant challenge. This study explores whether Large Language Models (LLMs) can emulate the breadth of human reasoning by predicting both intuitive, fast System 1 and deliberate, slow System 2 processes. We investigate the potential of AI to mimic diverse reasoning behaviors across a human population, addressing what we call the {\em full reasoning spectrum problem}. We designed reasoning tasks using a novel generalization of the Natural Language Inference (NLI) format to evaluate LLMs' ability to replicate human reasoning. The questions were crafted to elicit both System 1 and System 2 responses. Human responses were collected through crowd-sourcing and the entire distribution was modeled, rather than just the majority of the answers. We used personality-based prompting inspired by the Big Five personality model to elicit AI responses reflecting specific personality traits, capturing the diversity of human reasoning, and exploring how personality traits influence LLM outputs. Combined with genetic algorithms to optimize the weighting of these prompts, this method was tested alongside traditional machine learning models. The results show that LLMs can mimic human response distributions, with open-source models like Llama and Mistral outperforming proprietary GPT models. Personality-based prompting, especially when optimized with genetic algorithms, significantly enhanced LLMs' ability to predict human response distributions, suggesting that capturing suboptimal, naturalistic reasoning may require modeling techniques incorporating diverse reasoning styles and psychological profiles. The study concludes that personality-based prompting combined with genetic algorithms is promising for enhancing AI's \textit{human-ness} in reasoning.

124. 【2502.14145】LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

链接https://arxiv.org/abs/2502.14145

作者:Hao Zhang,Weiwei Li,Rilin Chen,Vinay Kothapally,Meng Yu,Dong Yu

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Achieving full-duplex communication, spoken dialogue systems, requires real-time coordination, full-duplex SDS, Achieving full-duplex

备注: In submission to INTERSPEECH 2025

点击查看摘要

Abstract:Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

125. 【2502.14144】UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text

链接https://arxiv.org/abs/2502.14144

作者:Primoz Kocbek,Leon Kopitar,Zhihong Zhang,Emirhan Aydin,Maxim Topaz,Gregor Stiglic

类目:Computation and Language (cs.CL)

关键词:simplify biomedical abstracts, PLABA track, 13-14 years, years old students, biomedical abstracts

备注: 10 pages, 2 figures, to be published in the 33rd Text REtrieval Conference (TREC 2024) proceedings

点击查看摘要

Abstract:This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students). We tested three approaches using OpenAI's gpt-4o and gpt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning. Adaptations were evaluated using qualitative metrics (5-point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch-Kincaid grade level, SMOG Index). Results indicated that the two-agent approach and baseline prompt engineering with gpt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt-4o-mini outperforms iterative improvement strategies via two-agent approach as well as fine-tuning with gpt-4o. We intend to expand our investigation of the results and explore advanced evaluations.

126. 【2502.14133】Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification

链接https://arxiv.org/abs/2502.14133

作者:Xuansheng Wu,Wenhao Yu,Xiaoming Zhai,Ninghao Liu

类目:Computation and Language (cs.CL)

关键词:methods heavily rely, Modern text classification, large language models, classification methods heavily, Modern text

备注: Pre-print, 15 pages, 4 figures

点击查看摘要

Abstract:Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impacts of these unintended features toward classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed framework can significantly improve the classifier's generalizability by regularizing those features that are not semantically correlated to each task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. We will release our code and data once accepted.

127. 【2502.14132】Can Community Notes Replace Professional Fact-Checkers?

链接https://arxiv.org/abs/2502.14132

作者:Nadav Borenstein,Greta Warren,Desmond Elliott,Isabelle Augenstein

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:platform users, commonly-employed strategies, strategies to combat, combat the rise, social media

备注

点击查看摘要

Abstract:Two commonly-employed strategies to combat the rise of misinformation on social media are (i) fact-checking by professional organisations and (ii) community moderation by platform users. Policy changes by Twitter/X and, more recently, Meta, signal a shift away from partnerships with fact-checking organisations and towards an increased reliance on crowdsourced community notes. However, the extent and nature of dependencies between fact-checking and helpful community notes remain unclear. To address these questions, we use language models to annotate a large corpus of Twitter/X community notes with attributes such as topic, cited sources, and whether they refute claims tied to broader misinformation narratives. Our analysis reveals that community notes cite fact-checking sources up to five times more than previously reported. Fact-checking is especially crucial for notes on posts linked to broader narratives, which are twice as likely to reference fact-checking sources compared to other sources. In conclusion, our results show that successful community moderation heavily relies on professional fact-checking.

128. 【2502.14127】Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

链接https://arxiv.org/abs/2502.14127

作者:Nishant Balepur,Rachel Rudinger,Jordan Lee Boyd-Graber

类目:Computation and Language (cs.CL)

关键词:Multiple choice question, choice question answering, Multiple choice, LLM evaluation due, question answering

备注: In-progress preprint

点击查看摘要

Abstract:Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

129. 【2502.14122】Benchmarking LLMs for Political Science: A United Nations Perspective

链接https://arxiv.org/abs/2502.14122

作者:Yueqing Liang,Liangwei Yang,Chen Wang,Congying Xia,Rui Meng,Xiongxiao Xu,Haoran Wang,Ali Payani,Kai Shu

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)

关键词:Large Language Models, natural language processing, remains largely unexplored, Large Language, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: this https URL.

130. 【2502.14119】Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility

链接https://arxiv.org/abs/2502.14119

作者:Xiaomeng Zhu,Zhenghao Zhou,Simon Charlow,Robert Frank

类目:Computation and Language (cs.CL)

关键词:language understanding abilities, natural language understanding, sentence levels, hierarchy of natural, abilities and argue

备注

点击查看摘要

Abstract:We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs' reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.

131. 【2502.14100】owards Context-Robust LLMs: A Gated Representation Fine-tuning Approach

链接https://arxiv.org/abs/2502.14100

作者:Shenglai Zeng,Pengfei He,Kai Guo,Tianqi Zheng,Hanqing Lu,Yue Xing,Hui Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, handling imperfect evidence, Language Models, retrieval-augmented generation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) enhanced with external contexts, such as through retrieval-augmented generation (RAG), often face challenges in handling imperfect evidence. They tend to over-rely on external knowledge, making them vulnerable to misleading and unhelpful contexts. To address this, we propose the concept of context-robust LLMs, which can effectively balance internal knowledge with external context, similar to human cognitive processes. Specifically, context-robust LLMs should rely on external context only when lacking internal knowledge, identify contradictions between internal and external knowledge, and disregard unhelpful contexts. To achieve this goal, we introduce Grft, a lightweight and plug-and-play gated representation fine-tuning approach. Grft consists of two key components: a gating mechanism to detect and filter problematic inputs, and low-rank representation adapters to adjust hidden representations. By training a lightweight intervention function with only 0.0004\% of model size on fewer than 200 examples, Grft can effectively adapt LLMs towards context-robust behaviors.

132. 【2502.14095】Retrieving Versus Understanding Extractive Evidence in Few-Shot Learning

链接https://arxiv.org/abs/2502.14095

作者:Karl Elbakian,Samuel Carton

类目:Computation and Language (cs.CL)

关键词:construct document-level decisions, evidence retrieval, document-level decisions, evidence retrieval errors, evidence

备注: 9 pages, 8 figures, Accepted to AAAI 2025 Main Conference (AI Alignment Track)

点击查看摘要

Abstract:A key aspect of alignment is the proper use of within-document evidence to construct document-level decisions. We analyze the relationship between the retrieval and interpretation of within-document evidence for large language model in a few-shot setting. Specifically, we measure the extent to which model prediction errors are associated with evidence retrieval errors with respect to gold-standard human-annotated extractive evidence for five datasets, using two popular closed proprietary models. We perform two ablation studies to investigate when both label prediction and evidence retrieval errors can be attributed to qualities of the relevant evidence. We find that there is a strong empirical relationship between model prediction and evidence retrieval error, but that evidence retrieval error is mostly not associated with evidence interpretation error--a hopeful sign for downstream applications built on this mechanism.

133. 【2502.14086】Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

链接https://arxiv.org/abs/2502.14086

作者:Cole Gawin,Yidan Sun,Mayank Kejriwal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, generating human-like text, Large language, achieved remarkable performance, solving reasoning tasks

备注: 5 pages, 3 figures, ACM Web Conference 2025

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

134. 【2502.14083】Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral

链接https://arxiv.org/abs/2502.14083

作者:Shivani Kumar,David Jurgens

类目:Computation and Language (cs.CL)

关键词:complex cognitive process, cognitive process shaped, presents unique challenges, Moral reasoning, complex cognitive

备注: 21 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators' moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral's utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.

135. 【2502.14074】Investigating Non-Transitivity in LLM-as-a-Judge

链接https://arxiv.org/abs/2502.14074

作者:Yi Xu,Laura Ruis,Tim Rocktäschel,Robert Kirk

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Automatic evaluation methods, evaluation methods based, Automatic evaluation, large language models, LLM-based agents

备注: 8 pages, 6 figures, 2 tables (30 pages, 11 figures, 8 tables including references and appendices)

点击查看摘要

Abstract:Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% - 96.4% and 82.1% - 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

136. 【2502.14051】RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

链接https://arxiv.org/abs/2502.14051

作者:Payman Behnam,Yaosheng Fu,Ritchie Zhao,Po-An Tsai,Zhiding Yu,Alexey Tumanov

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Transformer-based Large Language, Large Language Models, Language Models rely, Models rely critically, Transformer-based Large

备注

点击查看摘要

Abstract:Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse-grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3$\times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.

137. 【2502.14050】Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder

链接https://arxiv.org/abs/2502.14050

作者:Xianjun Yang,Shaoliang Nie,Lijuan Liu,Suchin Gururangan,Ujjwal Karn,Rui Hou,Madian Khabsa,Yuning Mao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Current pre-trained large, Current pre-trained, pre-trained large language, language models typically, large language models

备注

点击查看摘要

Abstract:Current pre-trained large language models typically need instruction tuning to align with human preferences. However, instruction tuning data is often quantity-saturated due to the large volume of data collection and fast model iteration, leaving coreset data selection important but underexplored. On the other hand, existing quality-driven data selection methods such as LIMA (NeurIPS 2023 (Zhou et al., 2024)) and AlpaGasus (ICLR 2024 (Chen et al.)) generally ignore the equal importance of data diversity and complexity. In this work, we aim to design a diversity-aware data selection strategy and creatively propose using sparse autoencoders to tackle the challenge of data diversity measure. In addition, sparse autoencoders can also provide more interpretability of model behavior and explain, e.g., the surprising effectiveness of selecting the longest response (ICML 2024 (Zhao et al.)). Using effective data selection, we experimentally prove that models trained on our selected data can outperform other methods in terms of model capabilities, reduce training cost, and potentially gain more control over model behaviors.

138. 【2502.14048】Semantic Decomposition and Selective Context Filtering -- Text Processing Techniques for Context-Aware NLP-Based Systems

链接https://arxiv.org/abs/2502.14048

作者:Karl John Villardar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Selective Context Filtering, sequentially decomposes input, decomposes input prompts, specific irrelevant sections, hierarchal information schema

备注

点击查看摘要

Abstract:In this paper, we present two techniques for use in context-aware systems: Semantic Decomposition, which sequentially decomposes input prompts into a structured and hierarchal information schema in which systems can parse and process easily, and Selective Context Filtering, which enables systems to systematically filter out specific irrelevant sections of contextual information that is fed through a system's NLP-based pipeline. We will explore how context-aware systems and applications can utilize these two techniques in order to implement dynamic LLM-to-system interfaces, improve an LLM's ability to generate more contextually cohesive user-facing responses, and optimize complex automated workflows and pipelines.

139. 【2502.14037】DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

链接https://arxiv.org/abs/2502.14037

作者:Giorgio Franceschelli,Mirco Musolesi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, reproduce training data, common grammatical structures, generate several repetitions, increasing performance

备注

点击查看摘要

Abstract:Despite their increasing performance, large language models still tend to reproduce training data, generate several repetitions, and focus on the most common grammatical structures and words. A possible cause is the decoding strategy adopted: the most common ones either consider only the most probable tokens, reducing output diversity, or increase the likelihood of unlikely tokens at the cost of output accuracy and correctness. In this paper, we propose a family of three new decoding methods by leveraging a mathematical analysis of the token probability distribution. In particular, the difference between consecutive, sorted probabilities can be used to avoid incorrect tokens and increase the chance of low-probable but accurate words. Experiments concerning math problem solving, extreme summarization, and the divergent association task show that our approach consistently performs at least as well as current alternatives in terms of quality and diversity.

140. 【2502.14019】Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems

链接https://arxiv.org/abs/2502.14019

作者:Myra Cheng,Su Lin Blodgett,Alicia DeVrio,Lisa Egede,Alexandra Olteanu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:text generation systems', raised increasing concerns, generation systems' outputs, developing emotional dependence, harmful outcomes

备注

点击查看摘要

Abstract:As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing concerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

141. 【2502.14010】Which Attention Heads Matter for In-Context Learning?

链接https://arxiv.org/abs/2502.14010

作者:Kayo Yin,Jacob Steinhardt

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, exhibit impressive in-context, Large language, impressive in-context learning, ICL

备注

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive in-context learning (ICL) capability, enabling them to perform new tasks using only a few demonstrations in the prompt. Two different mechanisms have been proposed to explain ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task. To better understand which of the two distinct mechanisms drives ICL, we study and compare induction heads and FV heads in 12 language models. Through detailed ablations, we discover that few-shot ICL performance depends primarily on FV heads, especially in larger models. In addition, we uncover that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism that ultimately drives ICL.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2502.14010 [cs.LG]

(or
arXiv:2502.14010v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2502.14010

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
142. 【2502.14008】MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures

链接https://arxiv.org/abs/2502.14008

作者:Jiayu Qin,Jianchao Tan,Kefeng Zhang,Xunliang Cai,Wei Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, attracted considerable attention, large language, language tasks, tasks has attracted

备注

点击查看摘要

Abstract:The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference. Structured pruning, an effective model compression technique, is gaining increasing attention due to its ability to enhance inference efficiency. Nevertheless, most previous optimization-based structured pruning methods sacrifice the uniform structure across layers for greater flexibility to maintain performance. The heterogeneous structure hinders the effective utilization of off-the-shelf inference acceleration techniques and impedes efficient configuration for continued training. To address this issue, we propose a novel masking learning paradigm based on minimax optimization to obtain the uniform pruned structure by optimizing the masks under sparsity regularization. Extensive experimental results demonstrate that our method can maintain high performance while ensuring the uniformity of the pruned model structure, thereby outperforming existing SOTA methods.

信息检索

1. 【2502.14862】Interpretable Text Embeddings and Text Similarity Explanation: A Primer

链接https://arxiv.org/abs/2502.14862

作者:Juri Opitz,Lucas Möller,Andrianos Michail,Simon Clematide

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:NLP systems, involving search, text embedding models, similarity scores, Text embeddings

备注

点击查看摘要

Abstract:Text embeddings and text embedding models are a backbone of many AI and NLP systems, particularly those involving search. However, interpretability challenges persist, especially in explaining obtained similarity scores, which is crucial for applications requiring transparency. In this paper, we give a structured overview of interpretability methods specializing in explaining those similarity scores, an emerging research area. We study the methods' individual ideas and techniques, evaluating their potential for improving interpretability of text embeddings and explaining predicted similarities.

2. 【2502.14822】A Survey of Model Architectures in Information Retrieval

链接https://arxiv.org/abs/2502.14822

作者:Zhichao Xu,Fengran Mo,Zhiqi Huang,Crystina Zhang,Puxuan Yu,Bei Wang,Jimmy Lin,Vivek Srikumar

类目:Information Retrieval (cs.IR)

关键词:system architectures, information retrieval, key aspects, relevance estimation, survey examines

备注

点击查看摘要

Abstract:This survey examines the evolution of model architectures in information retrieval (IR), focusing on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation. The review intentionally separates architectural considerations from training methodologies to provide a focused analysis of structural innovations in IR this http URL trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs). We conclude by discussing emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains beyond traditional search paradigms.

3. 【2502.14796】A Multi-Agent Perspective on Modern Information Retrieval

链接https://arxiv.org/abs/2502.14796

作者:Haya Nachimovsky,Moshe Tennenholtz,Oren Kurland

类目:Information Retrieval (cs.IR)

关键词:large language models, language models, rise of large, large language, era in information

备注

点击查看摘要

Abstract:The rise of large language models (LLMs) has introduced a new era in information retrieval (IR), where queries and documents that were once assumed to be generated exclusively by humans can now also be created by automated agents. These agents can formulate queries, generate documents, and perform ranking. This shift challenges some long-standing IR paradigms and calls for a reassessment of both theoretical frameworks and practical methodologies. We advocate for a multi-agent perspective to better capture the complex interactions between query agents, document agents, and ranker agents. Through empirical exploration of various multi-agent retrieval settings, we reveal the significant impact of these interactions on system performance. Our findings underscore the need to revisit classical IR paradigms and develop new frameworks for more effective modeling and evaluation of modern retrieval systems.

4. 【2502.14735】EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration

链接https://arxiv.org/abs/2502.14735

作者:Minjie Hong,Yan Xia,Zehan Wang,Jieming Zhu,Ye Wang,Sihang Cai,Xiaoda Yang,Quanyu Dai,Zhenhua Dong,Zhimeng Zhang,Zhou Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, offering enhanced capabilities, advanced recommender systems, offering enhanced

备注: 9 pages, 6 figures, accpeted by WWW 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly leveraged as foundational backbones in the development of advanced recommender systems, offering enhanced capabilities through their extensive knowledge and reasoning. Existing llm-based recommender systems (RSs) often face challenges due to the significant differences between the linguistic semantics of pre-trained LLMs and the collaborative semantics essential for RSs. These systems use pre-trained linguistic semantics but learn collaborative semantics from scratch via the llm-Backbone. However, LLMs are not designed for recommendations, leading to inefficient collaborative learning, weak result correlations, and poor integration of traditional RS features. To address these challenges, we propose EAGER-LLM, a decoder-only llm-based generative recommendation framework that integrates endogenous and exogenous behavioral and semantic information in a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich item indices that integrates indexing sequences for exogenous signals, enabling efficient link-wide processing; 2)non-invasive multiscale alignment reconstruction tasks guide the model toward a deeper understanding of both collaborative and semantic signals; 3)an annealing adapter designed to finely balance the model's recommendation performance with its comprehension capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing on three public benchmarks.

5. 【2502.14714】From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

链接https://arxiv.org/abs/2502.14714

作者:Ahmed Abdeen Hamed,Byung Suk Lee

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:LLM models present, models present opportunities, LLM model, select LLM model, LLM models

备注: 26 pages, 6 figures, In Review with a Cell Press Journal

点击查看摘要

Abstract:The generative capabilities of LLM models present opportunities in accelerating tasks and concerns with the authenticity of the knowledge it produces. To address the concerns, we present a computational approach that systematically evaluates the factual accuracy of biomedical knowledge that an LLM model has been prompted to generate. Our approach encompasses two processes: the generation of disease-centric associations and the verification of them using the semantic knowledge of the biomedical ontologies. Using ChatGPT as the select LLM model, we designed a set of prompt-engineering processes to generate linkages between diseases, drugs, symptoms, and genes to establish grounds for assessments. Experimental results demonstrate high accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and genetic information (88%-98%). The symptom term identification accuracy was notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO ontologies accordingly. The verification of associations reveals literature coverage rates of (89%-91%) among disease-drug and disease-gene associations. The low identification accuracy for symptom terms also contributed to the verification of symptom-related associations (49%-62%).

6. 【2502.14662】InstructAgent: Building User Controllable Recommender via LLM Agent

链接https://arxiv.org/abs/2502.14662

作者:Wujiang Xu,Yunxiao Shi,Zujie Liang,Xuying Ning,Kai Mei,Kun Wang,Xi Zhu,Min Xu,Yongfeng Zhang

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:platform recommendation algorithms, directly exposed, recommendation algorithms, users, paradigm

备注: WWW2025@HCRS

点击查看摘要

Abstract:Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as $\dataset$, along with user instructions for each record.

7. 【2502.14625】Multi-Record Web Page Information Extraction From News Websites

链接https://arxiv.org/abs/2502.14625

作者:Alexander Kustenkov,Maksim Varlamov,Alexander Yatskov

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:massive web data, pages, web pages, problem of extracting, growing importance

备注

点击查看摘要

Abstract:In this paper, we focused on the problem of extracting information from web pages containing many records, a task of growing importance in the era of massive web data. Recently, the development of neural network methods has improved the quality of information extraction from web pages. Nevertheless, most of the research and datasets are aimed at studying detailed pages. This has left multi-record "list pages" relatively understudied, despite their widespread presence and practical significance. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. This is the first dataset for this task in the Russian language. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity. Our dataset contains attributes of various types, including optional and multi-valued, providing a realistic representation of real-world list pages. These features make our dataset a valuable resource for studying information extraction from pages containing many records. Furthermore, we proposed our own multi-stage information extraction methods. In this work, we explore and demonstrate several strategies for applying MarkupLM to the specific challenges of multi-record web pages. Our experiments validate the advantages of our methods. By releasing our dataset to the public, we aim to advance the field of information extraction from multi-record pages.

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2502.14625 [cs.CL]

(or
arXiv:2502.14625v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2502.14625

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Aleksandr Yatskov [view email] [v1]
Thu, 20 Feb 2025 15:05:00 UTC (298 KB)

8. 【2502.14409】Unstructured Evidence Attribution for Long Context Query Focused Summarization

链接https://arxiv.org/abs/2502.14409

作者:Dustin Wright,Zain Muhammad Mujahid,Lu Wang,Isabelle Augenstein,David Jurgens

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, Large language, generating coherent summaries, capable of generating, generating coherent

备注: 24 pages; 21 figures; 5 tables

点击查看摘要

Abstract:Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query. Extracting and properly citing evidence spans could help improve the transparency and reliability of these summaries. At the same time, LLMs suffer from positional biases in terms of which information they understand and attend to, which could affect evidence citation. Whereas previous work has focused on evidence citation with predefined levels of granularity (e.g. sentence, paragraph, document, etc.), we propose the task of long-context query focused summarization with unstructured evidence citation. We show how existing systems struggle to generate and properly cite unstructured evidence from their context, and that evidence tends to be "lost-in-the-middle". To help mitigate this, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel domain-agnostic pipeline which can be used as supervision to adapt LLMs to this task. We demonstrate across 5 LLMs of different sizes and 4 datasets with varying document types and lengths that LLMs adapted with SUnsET data generate more relevant and factually consistent evidence than their base models, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries.

9. 【2502.14361】Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning

链接https://arxiv.org/abs/2502.14361

作者:Jiachen Zhu,Congmin Zheng,Jianghao Lin,Kounianhua Du,Ying Wen,Yong Yu,Jun Wang,Weinan Zhang

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:significantly advanced mathematical, Process Reward Models, large language models, Process Reward, advanced mathematical reasoning

备注

点击查看摘要

Abstract:While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM's ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.

10. 【2502.14332】A Collaborative Jade Recognition System for Mobile Devices Based on Lightweight and Large Models

链接https://arxiv.org/abs/2502.14332

作者:Zhenyu Wang,Wenjia Li,Pengyu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:vision-based recognition applications, topic in research, widespread adoption, hot topic, mobile devices

备注

点击查看摘要

Abstract:With the widespread adoption and development of mobile devices, vision-based recognition applications have become a hot topic in research. Jade, as an important cultural heritage and artistic item, has significant applications in fields such as jewelry identification and cultural relic preservation. However, existing jade recognition systems still face challenges in mobile implementation, such as limited computing resources, real-time requirements, and accuracy issues. To address these challenges, this paper proposes a jade recognition system based on size model collaboration, aiming to achieve efficient and accurate jade identification using mobile devices such as this http URL, we design a size model based on multi-scale image processing, extracting key visual information by analyzing jade's dimensions, shapes, and surface textures. Then, a collaborative multi-model classification framework is built by combining deep learning and traditional computer vision algorithms. This framework can effectively select and adjust models based on different jade characteristics, providing high accuracy results across various environments and this http URL results show that the proposed system can provide high recognition accuracy and fast processing time on mobile devices, while consuming relatively low computational resources. The system not only holds great application potential but also provides new ideas and technical support for the intelligent development of jade identification.

11. 【2502.14305】Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

链接https://arxiv.org/abs/2502.14305

作者:Kayhan Behdin,Yun Dai,Ata Fatahibaarzi,Aman Gupta,Qingquan Song,Shao Tang,Hejian Sang,Gregory Dexter,Sirou Zhu,Siyu Zhu,Tejas Dharamsi,Maziar Sanjabi,Vignesh Kothapalli,Hamed Firooz,Zhoutong Fu,Yihan Cao,Pin-Lun Hsu,Fedor Borisyuk,Zhipeng Wang,Rahul Mazumder,Natesh Pillai,Luke Simon

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:demonstrated remarkable performance, generative tasks, demonstrated remarkable, wide range, range of industrial

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendations to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present methods and insights for training small language models (SLMs) that deliver high performance and efficiency in deployment. We focus on two key techniques: (1) knowledge distillation and (2) model compression via quantization and pruning. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training, serving costs, and latency. We detail the impact of these techniques on a variety of use cases at a large professional social network platform and share deployment lessons - including hardware optimization strategies that enhance speed and throughput for both predictive and reasoning-based applications.

12. 【2502.14297】An Evaluation of Sakana's AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality Towards 'Artificial General Research Intelligence' (AGRI)?

链接https://arxiv.org/abs/2502.14297

作者:Joeran Beel,Min-Yen Kan,Moritz Baumgart

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:term Artificial General, Artificial General Intelligence, Artificial General, Artificial General Research, General Research Intelligence

备注: 16 pages

点击查看摘要

Abstract:A major step toward Artificial General Intelligence (AGI) and Super Intelligence is AI's ability to autonomously conduct research - what we term Artificial General Research Intelligence (AGRI). If machines could generate hypotheses, conduct experiments, and write research papers without human intervention, it would transform science. Recently, this http URL introduced the AI Scientist, a system claiming to automate the research lifecycle, generating both excitement and skepticism. We evaluated the AI Scientist and found it a milestone in AI-driven research. While it streamlines some aspects, it falls short of expectations. Literature reviews are weak, nearly half the experiments failed, and manuscripts sometimes contain hallucinated results. Most notably, users must provide an experimental pipeline, limiting the AI Scientist's autonomy in research design and execution. Despite its limitations, the AI Scientist advances research automation. Many reviewers or instructors who assess work superficially may not recognize its output as AI-generated. The system produces research papers with minimal human effort and low cost. Our analysis suggests a paper costs a few USD with a few hours of human involvement, making it significantly faster than human researchers. Compared to AI capabilities from a few years ago, this marks progress toward AGRI. The rise of AI-driven research systems requires urgent discussion within Information Retrieval (IR) and broader scientific communities. Enhancing literature retrieval, citation validation, and evaluation benchmarks could improve AI-generated research reliability. We propose concrete steps, including AGRI-specific benchmarks, refined peer review, and standardized attribution frameworks. Whether AGRI becomes a stepping stone to AGI depends on how the academic and AI communities shape its development.

Comments:
16 pages

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2502.14297 [cs.IR]

(or
arXiv:2502.14297v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2502.14297

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Joeran Beel [view email] [v1]
Thu, 20 Feb 2025 06:22:03 UTC (815 KB)

13. 【2502.14212】Less is More: On the Importance of Data Quality for Unit Test Generation

链接https://arxiv.org/abs/2502.14212

作者:Junwei Zhang,Xing Hu,Shan Gao,Xin Xia,David Lo,Shanping Li

类目:oftware Engineering (cs.SE); Information Retrieval (cs.IR)

关键词:test generation, Unit testing, Effective unit testing, generation, unit test generation

备注

点击查看摘要

Abstract:Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models.

14. 【2502.14137】Collaborative Retrieval for Large Language Model-based Conversational Recommender Systems

链接https://arxiv.org/abs/2502.14137

作者:Yaochen Zhu,Chao Wan,Harald Steck,Dawen Liang,Yesu Feng,Nathan Kallus,Jundong Li

类目:Information Retrieval (cs.IR)

关键词:Conversational recommender systems, provide personalized recommendations, Retrieval Augmented Generation, recommender systems, aim to provide

备注: Accepted by WWW'2025

点击查看摘要

Abstract:Conversational recommender systems (CRS) aim to provide personalized recommendations via interactive dialogues with users. While large language models (LLMs) enhance CRS with their superior understanding of context-aware user preferences, they typically struggle to leverage behavioral data, which have proven to be important for classical collaborative filtering (CF)-based approaches. For this reason, we propose CRAG, Collaborative Retrieval Augmented Generation for LLM-based CRS. To the best of our knowledge, CRAG is the first approach that combines state-of-the-art LLMs with CF for conversational recommendations. Our experiments on two publicly available movie conversational recommendation datasets, i.e., a refined Reddit dataset (which we name Reddit-v2) as well as the Redial dataset, demonstrate the superior item coverage and recommendation performance of CRAG, compared to several CRS baselines. Moreover, we observe that the improvements are mainly due to better recommendation accuracy on recently released movies. The code and data are available at this https URL.

15. 【2502.14100】owards Context-Robust LLMs: A Gated Representation Fine-tuning Approach

链接https://arxiv.org/abs/2502.14100

作者:Shenglai Zeng,Pengfei He,Kai Guo,Tianqi Zheng,Hanqing Lu,Yue Xing,Hui Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, handling imperfect evidence, Language Models, retrieval-augmented generation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) enhanced with external contexts, such as through retrieval-augmented generation (RAG), often face challenges in handling imperfect evidence. They tend to over-rely on external knowledge, making them vulnerable to misleading and unhelpful contexts. To address this, we propose the concept of context-robust LLMs, which can effectively balance internal knowledge with external context, similar to human cognitive processes. Specifically, context-robust LLMs should rely on external context only when lacking internal knowledge, identify contradictions between internal and external knowledge, and disregard unhelpful contexts. To achieve this goal, we introduce Grft, a lightweight and plug-and-play gated representation fine-tuning approach. Grft consists of two key components: a gating mechanism to detect and filter problematic inputs, and low-rank representation adapters to adjust hidden representations. By training a lightweight intervention function with only 0.0004\% of model size on fewer than 200 examples, Grft can effectively adapt LLMs towards context-robust behaviors.

计算机视觉

1. 【2502.14865】me Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

链接https://arxiv.org/abs/2502.14865

作者:Sara Ghaboura,Ketan More,Ritesh Thawkar,Wafa Alghallabi,Omkar Thawakar,Fahad Shahbaz Khan,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:advanced computational techniques, artifacts demands human, demands human expertise, process remains complex, cultural artifacts demands

备注: 4 pages, 6 figures

点击查看摘要

Abstract:Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. Our code is available at: \url{this https URL}.

2. 【2502.14864】Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

链接https://arxiv.org/abs/2502.14864

作者:Yuming Yang,Jiang Zhong,Li Jin,Jingwang Huang,Jingpeng Gao,Qing Liu,Yang Bai,Jingyuan Zhang,Rui Jiang,Kaiwen Wei

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:integrating external knowledge, enhances reasoning capabilities, Multimodal Retrieval-Augmented Generation, external knowledge, Chart-based MRAG

备注

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at this https URL.

3. 【2502.14846】Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

链接https://arxiv.org/abs/2502.14846

作者:Yue Yang,Ajay Patel,Matt Deitke,Tanmay Gupta,Luca Weihs,Andrew Head,Mark Yatskar,Chris Callison-Burch,Ranjay Krishna,Aniruddha Kembhavi,Christopher Clark

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:charts and documents, critical application, Reasoning, data, Reasoning about images

备注: 20 pages, 19 figures, 9 tables, website: [this https URL](https://yueyang1996.github.io/cosyn/)

点击查看摘要

Abstract:Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

4. 【2502.14844】Dynamic Concepts Personalization from Single Videos

链接https://arxiv.org/abs/2502.14844

作者:Rameen Abdal,Or Patashnik,Ivan Skorokhodov,Willi Menapace,Aliaksandr Siarohin,Sergey Tulyakov,Daniel Cohen-Or,Kfir Aberman

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:presents unique challenges, models presents unique, personalizing Diffusion Transformers, remarkable progress, unique challenges

备注: Webpage: [this https URL](https://snap-research.github.io/dynamic_concepts/)

点击查看摘要

Abstract:Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

5. 【2502.14834】LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

链接https://arxiv.org/abs/2502.14834

作者:Shangqing Tu,Yucheng Wang,Daniel Zhang-Li,Yushi Bai,Jifan Yu,Yuhao Wu,Lei Hou,Huiqin Liu,Zhiyuan Liu,Bin Xu,Juanzi Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Existing Large Vision-Language, Existing Large, Large Vision-Language Models, Large Vision-Language, generate coherent outputs

备注

点击查看摘要

Abstract:Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: this https URL

6. 【2502.14831】Improving the Diffusability of Autoencoders

链接https://arxiv.org/abs/2502.14831

作者:Ivan Skorokhodov,Sharath Girish,Benran Hu,Willi Menapace,Yanyu Li,Rameen Abdal,Sergey Tulyakov,Aliaksandr Siarohin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:utilizing compressed latent, compressed latent representations, generating high-quality images, Latent diffusion models, utilizing compressed

备注: 26 pages, 22 figures, 9 tables

点击查看摘要

Abstract:Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256.

7. 【2502.14827】Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison

链接https://arxiv.org/abs/2502.14827

作者:Aiswarya Baby,Tintu Thankom Koshy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

关键词:Visual Question Answering, natural language processing, Question Answering, natural language questions, visual content

备注: 8 pages, No figures

点击查看摘要

Abstract:Visual Question Answering (VQA) has emerged as a pivotal task in the intersection of computer vision and natural language processing, requiring models to understand and reason about visual content in response to natural language questions. Analyzing VQA datasets is essential for developing robust models that can handle the complexities of multimodal reasoning. Several approaches have been developed to examine these datasets, each offering distinct perspectives on question diversity, answer distribution, and visual-textual correlations. Despite significant progress, existing VQA models face challenges related to dataset bias, limited model complexity, commonsense reasoning gaps, rigid evaluation methods, and generalization to real world scenarios. This paper presents a comprehensive comparative study of five advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling, BLIP-2, and OFA, each employing distinct methodologies to address these challenges.

8. 【2502.14801】AVD2: Accident Video Diffusion for Accident Video Description

链接https://arxiv.org/abs/2502.14801

作者:Cheng Li,Keyuan Zhou,Tong Liu,Yu Wang,Mingqiao Zhuang,Huan-ang Gao,Bu Jin,Hao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accident Video Diffusion, Accident Video Description, Accident Video Understanding, generating accident videos, Multi-Modal Accident Video

备注: ICRA 2025, Project Page: [this https URL](https://an-answer-tree.github.io/)

点击查看摘要

Abstract:Traffic accidents present complex challenges for autonomous driving, often featuring unpredictable scenarios that hinder accurate system interpretation and this http URL, prevailing methodologies fall short in elucidating the causes of accidents and proposing preventive measures due to the paucity of training data specific to accident this http URL this work, we introduce AVD2 (Accident Video Diffusion for Accident Video Description), a novel framework that enhances accident scene understanding by generating accident videos that aligned with detailed natural language descriptions and reasoning, resulting in the contributed EMM-AU (Enhanced Multi-Modal Accident Video Understanding) dataset. Empirical results reveal that the integration of the EMM-AU dataset establishes state-of-the-art performance across both automated metrics and human evaluations, markedly advancing the domains of accident analysis and prevention. Project resources are available at this https URL

9. 【2502.14799】A Survey on Text-Driven 360-Degree Panorama Generation

链接https://arxiv.org/abs/2502.14799

作者:Hai Wang,Xiaoyu Xiang,Weihao Xia,Jing-Hao Xue

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:panoramic images directly, visual content creation, immersive visual content, enabling the synthesis, panoramic images

备注

点击查看摘要

Abstract:The advent of text-driven 360-degree panorama generation, enabling the synthesis of 360-degree panoramic images directly from textual descriptions, marks a transformative advancement in immersive visual content creation. This innovation significantly simplifies the traditionally complex process of producing such content. Recent progress in text-to-image diffusion models has accelerated the rapid development in this emerging field. This survey presents a comprehensive review of text-driven 360-degree panorama generation, offering an in-depth analysis of state-of-the-art algorithms and their expanding applications in 360-degree 3D scene generation. Furthermore, we critically examine current limitations and propose promising directions for future research. A curated project page with relevant resources and research papers is available at this https URL.

10. 【2502.14795】Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

链接https://arxiv.org/abs/2502.14795

作者:Pengxiang Ding,Jianfei Ma,Xinyang Tong,Binghong Zou,Xinxin Luo,Yiguo Fan,Ting Wang,Hongchao Lu,Panzhong Mo,Jinxin Liu,Yuefan Wang,Huaicheng Zhou,Wenshuo Feng,Jiacheng Liu,Siteng Huang,Donglin Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:current humanoid robot, autonomous interaction capabilities, interaction capabilities due, lack autonomous interaction, humanoid robot control

备注

点击查看摘要

Abstract:This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.

11. 【2502.14792】RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View Segmentation

链接https://arxiv.org/abs/2502.14792

作者:Henrique Piñeiro Monteagudo,Leonardo Taccari,Aurel Pjetri,Francesco Sambo,Samuele Salti

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Bird Eye View, Bird Eye, autonomous driving tasks, Eye View, driving tasks

备注: Accepted at WACV 2025

点击查看摘要

Abstract:Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention as a useful representation of the environment to tackle assisted and autonomous driving tasks. However, most of the existing work focuses on the fully supervised setting, training networks on large annotated datasets. In this work, we present RendBEV, a new method for the self-supervised training of BEV semantic segmentation networks, leveraging differentiable volumetric rendering to receive supervision from semantic perspective views computed by a 2D semantic segmentation model. Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results in this challenging setting. When used as pretraining to then fine-tune on labeled BEV ground-truth, our method significantly boosts performance in low-annotation regimes, and sets a new state of the art when fine-tuning on all available labels.

12. 【2502.14789】Structurally Disentangled Feature Fields Distillation for 3D Understanding and Editing

链接https://arxiv.org/abs/2502.14789

作者:Yoel Levy,David Shavin,Itai Lang,Sagie Benaim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent work, distill pre-trained, large pre-trained, demonstrated the ability, ability to leverage

备注

点击查看摘要

Abstract:Recent work has demonstrated the ability to leverage or distill pre-trained 2D features obtained using large pre-trained 2D models into 3D features, enabling impressive 3D editing and understanding capabilities using only 2D supervision. Although impressive, models assume that 3D features are captured using a single feature field and often make a simplifying assumption that features are view-independent. In this work, we propose instead to capture 3D features using multiple disentangled feature fields that capture different structural components of 3D features involving view-dependent and view-independent components, which can be learned from 2D feature supervision only. Subsequently, each element can be controlled in isolation, enabling semantic and structural understanding and editing capabilities. For instance, using a user click, one can segment 3D features corresponding to a given object and then segment, edit, or remove their view-dependent (reflective) properties. We evaluate our approach on the task of 3D segmentation and demonstrate a set of novel understanding and editing tasks.

13. 【2502.14786】SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

链接https://arxiv.org/abs/2502.14786

作者:Michael Tschannen,Alexey Gritsenko,Xiao Wang,Muhammad Ferjad Naeem,Ibrahim Alabdulmohsin,Nikhil Parthasarathy,Talfan Evans,Lucas Beyer,Ye Xia,Basil Mustafa,Olivier Hénaff,Jeremiah Harmsen,Andreas Steiner,Xiaohua Zhai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:encoders that build, multilingual vision-language encoders, vision-language encoders, introduce SigLIP, original image-text training

备注: Model checkpoints are available at [this https URL](https://github.com/google-research/big_vision/tree/main/big_vision/configs/proj/image_text/README_siglip2.md)

点击查看摘要

Abstract:We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

14. 【2502.14780】ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

链接https://arxiv.org/abs/2502.14780

作者:Abhijit Mishra,Richard Noh,Hsiang Fu,Mingda Li,Minji Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Efficient and privacy-preserving, privacy-preserving multimodal interaction, human-computer communication, Instruction Rewriting, modern smartphones

备注: 12 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

15. 【2502.14779】DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models

链接https://arxiv.org/abs/2502.14779

作者:Hongji Yang,Wencheng Han,Yucheng Zhou,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precisely controllable framework, highly flexible, flexible and precisely, precisely controllable, controllable framework

备注

点击查看摘要

Abstract:In this paper, we introduce DC (Decouple)-ControlNet, a highly flexible and precisely controllable framework for multi-condition image generation. The core idea behind DC-ControlNet is to decouple control conditions, transforming global control into a hierarchical system that integrates distinct elements, contents, and layouts. This enables users to mix these individual conditions with greater flexibility, leading to more efficient and accurate image generation control. Previous ControlNet-based models rely solely on global conditions, which affect the entire image and lack the ability of element- or region-specific control. This limitation reduces flexibility and can cause condition misunderstandings in multi-conditional image generation. To address these challenges, we propose both intra-element and Inter-element Controllers in DC-ControlNet. The Intra-Element Controller handles different types of control signals within individual elements, accurately describing the content and layout characteristics of the object. For interactions between elements, we introduce the Inter-Element Controller, which accurately handles multi-element interactions and occlusion based on user-defined relationships. Extensive evaluations show that DC-ControlNet significantly outperforms existing ControlNet models and Layout-to-Image generative models in terms of control flexibility and precision in multi-condition control.

16. 【2502.14778】Harnessing PDF Data for Improving Japanese Large Multimodal Models

链接https://arxiv.org/abs/2502.14778

作者:Jeonghun Baek,Akiko Aizawa,Kiyoharu Aizawa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Multimodal Models, remains limited due, Japanese remains limited, Large Multimodal, Japanese LMMs

备注: 15 pages, 8 figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs. We plan to make the source code and data publicly available upon acceptance.

17. 【2502.14762】Sculpting [CLS] Features for Pre-Trained Model-Based Class-Incremental Learning

链接https://arxiv.org/abs/2502.14762

作者:Murat Onur Yildirim,Elif Ceren Gok Yildirim,Joaquin Vanschoren

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Class-incremental learning requires, learning requires models, Class-incremental learning, continually acquire knowledge, pre-trained models

备注

点击查看摘要

Abstract:Class-incremental learning requires models to continually acquire knowledge of new classes without forgetting old ones. Although pre-trained models have demonstrated strong performance in class-incremental learning, they remain susceptible to catastrophic forgetting when learning new concepts. Excessive plasticity in the models breaks generalizability and causes forgetting, while strong stability results in insufficient adaptation to new classes. This necessitates effective adaptation with minimal modifications to preserve the general knowledge of pre-trained models. To address this challenge, we first introduce a new parameter-efficient fine-tuning module 'Learn and Calibrate', or LuCA, designed to acquire knowledge through an adapter-calibrator couple, enabling effective adaptation with well-refined feature representations. Second, for each learning session, we deploy a sparse LuCA module on top of the last token just before the classifier, which we refer to as 'Token-level Sparse Calibration and Adaptation', or TOSCA. This strategic design improves the orthogonality between the modules and significantly reduces both training and inference complexity. By leaving the generalization capabilities of the pre-trained models intact and adapting exclusively via the last token, our approach achieves a harmonious balance between stability and plasticity. Extensive experiments demonstrate TOSCA's state-of-the-art performance while introducing ~8 times fewer parameters compared to prior methods.

18. 【2502.14740】YOLOv12: A Breakdown of the Key Architectural Features

链接https://arxiv.org/abs/2502.14740

作者:Mujadded Al Rabbani Alif,Muhammad Hussain

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:introducing key improvements, object detection building, advancement in single-stage, key improvements, real-time object detection

备注

点击查看摘要

Abstract:This paper presents an architectural analysis of YOLOv12, a significant advancement in single-stage, real-time object detection building upon the strengths of its predecessors while introducing key improvements. The model incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and FlashAttention-driven area-based attention, improving feature extraction, enhanced efficiency, and robust detections. With multiple model variants, similar to its predecessors, YOLOv12 offers scalable solutions for both latency-sensitive and high-accuracy applications. Experimental results manifest consistent gains in mean average precision (mAP) and inference speed, making YOLOv12 a compelling choice for applications in autonomous systems, security, and real-time analytics. By achieving an optimal balance between computational efficiency and performance, YOLOv12 sets a new benchmark for real-time computer vision, facilitating deployment across diverse hardware platforms, from edge devices to high-performance clusters.

19. 【2502.14721】Multi-dataset synergistic in supervised learning to pre-label structural components in point clouds from shell construction scenes

链接https://arxiv.org/abs/2502.14721

作者:Lukas Rauch,Thomas Braml

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant effort required, hinders computer vision, computer vision research, datasets hinders computer, significant effort

备注: 18 pages, 8 figures, 7 tables

点击查看摘要

Abstract:The significant effort required to annotate data for new training datasets hinders computer vision research and machine learning in the construction industry. This work explores adapting standard datasets and the latest transformer model architectures for point cloud semantic segmentation in the context of shell construction sites. Unlike common approaches focused on object segmentation of building interiors and furniture, this study addressed the challenges of segmenting complex structural components in Architecture, Engineering, and Construction (AEC). We establish a baseline through supervised training and a custom validation dataset, evaluate the cross-domain inference with large-scale indoor datasets, and utilize transfer learning to maximize segmentation performance with minimal new data. The findings indicate that with minimal fine-tuning, pre-trained transformer architectures offer an effective strategy for building component segmentation. Our results are promising for automating the annotation of new, previously unseen data when creating larger training resources and for the segmentation of frequently recurring objects.

20. 【2502.14684】CDGS: Confidence-Aware Depth Regularization for 3D Gaussian Splatting

链接https://arxiv.org/abs/2502.14684

作者:Qilin Zhang,Olaf Wysocki,Steffen Urban,Boris Jutzi

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, shown significant advantages, achieving high rendering, high rendering speeds, view synthesis

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown significant advantages in novel view synthesis (NVS), particularly in achieving high rendering speeds and high-quality results. However, its geometric accuracy in 3D reconstruction remains limited due to the lack of explicit geometric constraints during optimization. This paper introduces CDGS, a confidence-aware depth regularization approach developed to enhance 3DGS. We leverage multi-cue confidence maps of monocular depth estimation and sparse Structure-from-Motion depth to adaptively adjust depth supervision during the optimization process. Our method demonstrates improved geometric detail preservation in early training stages and achieves competitive performance in both NVS quality and geometric accuracy. Experiments on the publicly available Tanks and Temples benchmark dataset show that our method achieves more stable convergence behavior and more accurate geometric reconstruction results, with improvements of up to 2.31 dB in PSNR for NVS and consistently lower geometric errors in M3C2 distance metrics. Notably, our method reaches comparable F-scores to the original 3DGS with only 50% of the training iterations. We expect this work will facilitate the development of efficient and accurate 3D reconstruction systems for real-world applications such as digital twin creation, heritage preservation, or forestry applications.

21. 【2502.14676】BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction

链接https://arxiv.org/abs/2502.14676

作者:Ruochen Li,Stamos Katsigiannis,Tae-Kyun Kim,Hubert P. H. Shum

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:short-term future movement, Trajectory prediction, decision-making in applications, applications of autonomous, surveillance by predicting

备注

点击查看摘要

Abstract:Trajectory prediction allows better decision-making in applications of autonomous vehicles or surveillance by predicting the short-term future movement of traffic agents. It is classified into pedestrian or heterogeneous trajectory prediction. The former exploits the relatively consistent behavior of pedestrians, but is limited in real-world scenarios with heterogeneous traffic agents such as cyclists and vehicles. The latter typically relies on extra class label information to distinguish the heterogeneous agents, but such labels are costly to annotate and cannot be generalized to represent different behaviors within the same class of agents. In this work, we introduce the behavioral pseudo-labels that effectively capture the behavior distributions of pedestrians and heterogeneous agents solely based on their motion features, significantly improving the accuracy of trajectory prediction. To implement the framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a trajectory predictor. For optimization, we propose a cascaded training scheme, in which we first learn the pseudo-labels in an unsupervised manner, and then perform end-to-end fine-tuning on the labels in the direction of increasing the trajectory prediction accuracy. Experiments show that our pseudo-labels effectively model different behavior clusters and improve trajectory prediction. Our proposed BP-SGCN outperforms existing methods using both pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets (SDD, Argoverse 1).

22. 【2502.14659】MAGO-SP: Detection and Correction of Water-Fat Swaps in Magnitude-Only VIBE MRI

链接https://arxiv.org/abs/2502.14659

作者:Robert Graf,Hendrik Möller,Sophie Starck,Matan Atad,Philipp Braun,Jonathan Stelter,Annette Peters,Lilian Krist,Stefan N. Willich,Henry Völzke,Robin Bülow,Klaus Berger,Tobias Pischon,Thoralf Niendorf,Johannes Paetzold,Dimitrios Karampinos,Daniel Rueckert,Jan Kirschke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Interpolated Breath-Hold Examination, Volume Interpolated Breath-Hold, Breath-Hold Examination, Interpolated Breath-Hold, generates images suitable

备注

点击查看摘要

Abstract:Volume Interpolated Breath-Hold Examination (VIBE) MRI generates images suitable for water and fat signal composition estimation. While the two-point VIBE provides water-fat-separated images, the six-point VIBE allows estimation of the effective transversal relaxation rate R2* and the proton density fat fraction (PDFF), which are imaging markers for health and disease. Ambiguity during signal reconstruction can lead to water-fat swaps. This shortcoming challenges the application of VIBE-MRI for automated PDFF analyses of large-scale clinical data and of population studies. This study develops an automated pipeline to detect and correct water-fat swaps in non-contrast-enhanced VIBE images. Our three-step pipeline begins with training a segmentation network to classify volumes as "fat-like" or "water-like," using synthetic water-fat swaps generated by merging fat and water volumes with Perlin noise. Next, a denoising diffusion image-to-image network predicts water volumes as signal priors for correction. Finally, we integrate this prior into a physics-constrained model to recover accurate water and fat signals. Our approach achieves a 1% error rate in water-fat swap detection for a 6-point VIBE. Notably, swaps disproportionately affect individuals in the Underweight and Class 3 Obesity BMI categories. Our correction algorithm ensures accurate solution selection in chemical phase MRIs, enabling reliable PDFF estimation. This forms a solid technical foundation for automated large-scale population imaging analysis.

23. 【2502.14638】NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

链接https://arxiv.org/abs/2502.14638

作者:Zheyuan Zhang,Runze Li,Tasnim Kabir,Jordan Boyd-Graber

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:requires complex reasoning, cultural contexts, predicting the specific, specific location, requires complex

备注

点击查看摘要

Abstract:Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at this https URL.

24. 【2502.14616】Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion

链接https://arxiv.org/abs/2502.14616

作者:Jiangyuan Liu,Hongxuan Ma,Yuxin Guo,Yuhao Zhao,Chi Zhang,Wei Sui,Wei Zou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:numerous robotic tasks, transparent objects, indispensable for numerous, numerous robotic, transparent objects remain

备注: Accepted by ICRA(2025). The code is accessible through: [this https URL](https://github.com/L-J-Yuan/MODEST)

点击查看摘要

Abstract:Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at this https URL.

25. 【2502.14573】Self-supervised Monocular Depth Estimation Robust to Reflective Surface Leveraged by Triplet Mining

链接https://arxiv.org/abs/2502.14573

作者:Wonhyeok Choi,Kyumin Hwang,Wei Peng,Minwoo Choi,Sunghoon Im

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:RGB image sequences, Self-supervised monocular depth, Self-supervised monocular, RGB image, ground-truth depth labels

备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Self-supervised monocular depth estimation (SSMDE) aims to predict the dense depth map of a monocular image, by learning depth from RGB image sequences, eliminating the need for ground-truth depth labels. Although this approach simplifies data acquisition compared to supervised methods, it struggles with reflective surfaces, as they violate the assumptions of Lambertian reflectance, leading to inaccurate training on such surfaces. To tackle this problem, we propose a novel training strategy for an SSMDE by leveraging triplet mining to pinpoint reflective regions at the pixel level, guided by the camera geometry between different viewpoints. The proposed reflection-aware triplet mining loss specifically penalizes the inappropriate photometric error minimization on the localized reflective regions while preserving depth accuracy in non-reflective areas. We also incorporate a reflection-aware knowledge distillation method that enables a student model to selectively learn the pixel-level knowledge from reflective and non-reflective regions. This results in robust depth estimation across areas. Evaluation results on multiple datasets demonstrate that our method effectively enhances depth quality on reflective surfaces and outperforms state-of-the-art SSMDE baselines.

26. 【2502.14520】Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance

链接https://arxiv.org/abs/2502.14520

作者:Meng Wang,Fan Wu,Ruihui Li,Yunchuan Qin,Zhuo Tang,Kenli Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving perception, Semantic Scene Completion, comprehensive scene geometry, Scene Completion, temporal SSC method

备注

点击查看摘要

Abstract:3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and semantics for autonomous driving perception, which is crucial for enabling accurate and reliable decision-making. However, existing SSC methods are limited to capturing sparse information from the current frame or naively stacking multi-frame temporal features, thereby failing to acquire effective scene context. These approaches ignore critical motion dynamics and struggle to achieve temporal consistency. To address the above challenges, we propose a novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance. By leveraging optical flow, FlowScene can integrate motion, different viewpoints, occlusions, and other contextual cues, thereby significantly improving the accuracy of 3D scene completion. Specifically, our framework introduces two key components: (1) a Flow-Guided Temporal Aggregation module that aligns and aggregates temporal features using optical flow, capturing motion-aware context and deformable structures; and (2) an Occlusion-Guided Voxel Refinement module that injects occlusion masks and temporally aggregated features into 3D voxel space, adaptively refining voxel representations for explicit geometric modeling. Experimental results demonstrate that FlowScene achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.

27. 【2502.14514】A Mobile Robotic Approach to Autonomous Surface Scanning in Legal Medicine

链接https://arxiv.org/abs/2502.14514

作者:Sarah Grube,Sarah Latus,Martin Fischer,Vidas Raudonis,Axel Heinemann,Benjamin Ondruschka,Alexander Schlaefer

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:Comprehensive legal medicine, Comprehensive legal, mobile robotic system, legal medicine, surface

备注: Submitted and accepted for presentation at CARS 2025. This preprint has not undergone peer review or post-submission revisions. The final version of this work will appear in the official CARS 2025 proceedings

点击查看摘要

Abstract:Purpose: Comprehensive legal medicine documentation includes both an internal but also an external examination of the corpse. Typically, this documentation is conducted manually during conventional autopsy. A systematic digital documentation would be desirable, especially for the external examination of wounds, which is becoming more relevant for legal medicine analysis. For this purpose, RGB surface scanning has been introduced. While a manual full surface scan using a handheld camera is timeconsuming and operator dependent, floor or ceiling mounted robotic systems require substantial space and a dedicated room. Hence, we consider whether a mobile robotic system can be used for external documentation. Methods: We develop a mobile robotic system that enables full-body RGB-D surface scanning. Our work includes a detailed configuration space analysis to identify the environmental parameters that need to be considered to successfully perform a surface scan. We validate our findings through an experimental study in the lab and demonstrate the system's application in a legal medicine environment. Results: Our configuration space analysis shows that a good trade-off between coverage and time is reached with three robot base positions, leading to a coverage of 94.96 %. Experiments validate the effectiveness of the system in accurately capturing body surface geometry with an average surface coverage of 96.90 +- 3.16 % and 92.45 +- 1.43 % for a body phantom and actual corpses, respectively. Conclusion: This work demonstrates the potential of a mobile robotic system to automate RGB-D surface scanning in legal medicine, complementing the use of post-mortem CT scans for inner documentation. Our results indicate that the proposed system can contribute to more efficient and autonomous legal medicine documentation, reducing the need for manual intervention.

28. 【2502.14504】PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

链接https://arxiv.org/abs/2502.14504

作者:Yu Meng,Kaiyuan Li,Chenran Huang,Chen Gao,Xinlei Chen,Yong Li,Xiaoping Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, demonstrated remarkable capabilities, Vision-Language Models, Vision Token Pruning, Vision Token

备注: 12 pages, 8 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of 0.46% average performance drop, while also achieving notable performance improvements in multi-image tasks. These results highlight the effectiveness of fine-grained token pruning and contribute to advancing the efficiency and scalability of LVLMs. Our source code will be made publicly available.

29. 【2502.14503】LXLv2: Enhanced LiDAR Excluded Lean 3D Object Detection with Fusion of 4D Radar and Camera

链接https://arxiv.org/abs/2502.14503

作者:Weiyi Xiong,Zean Zou,Qiuchi Zhao,Fengchun He,Bing Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image view transformation, sampling-based image view, predicted image depth, image depth distribution, depth distribution maps

备注: Accepted by IEEE Robotics and Automation Letters

点击查看摘要

Abstract:As the previous state-of-the-art 4D radar-camera fusion-based 3D object detection method, LXL utilizes the predicted image depth distribution maps and radar 3D occupancy grids to assist the sampling-based image view transformation. However, the depth prediction lacks accuracy and consistency, and the concatenation-based fusion in LXL impedes the model robustness. In this work, we propose LXLv2, where modifications are made to overcome the limitations and improve the performance. Specifically, considering the position error in radar measurements, we devise a one-to-many depth supervision strategy via radar points, where the radar cross section (RCS) value is further exploited to adjust the supervision area for object-level depth consistency. Additionally, a channel and spatial attention-based fusion module named CSAFusion is introduced to improve feature adaptiveness. Experimental results on the View-of-Delft and TJ4DRadSet datasets show that the proposed LXLv2 can outperform LXL in detection accuracy, inference speed and robustness, demonstrating the effectiveness of the model.

30. 【2502.14495】Nearshore Underwater Target Detection Meets UAV-borne Hyperspectral Remote Sensing: A Novel Hybrid-level Contrastive Learning Framework and Benchmark Dataset

链接https://arxiv.org/abs/2502.14495

作者:Jiahao Qi,Chuanhong Zhou,Xingyue Liu,Chen Chen,Dehui Zhu,Kangcheng Bin,Ping Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:UAV-borne hyperspectral remote, hyperspectral remote sensing, traditional hyperspectral UTD, remote sensing, sensing has emerged

备注: 18pages,13figures

点击查看摘要

Abstract:UAV-borne hyperspectral remote sensing has emerged as a promising approach for underwater target detection (UTD). However, its effectiveness is hindered by spectral distortions in nearshore environments, which compromise the accuracy of traditional hyperspectral UTD (HUTD) methods that rely on bathymetric model. These distortions lead to significant uncertainty in target and background spectra, challenging the detection process. To address this, we propose the Hyperspectral Underwater Contrastive Learning Network (HUCLNet), a novel framework that integrates contrastive learning with a self-paced learning paradigm for robust HUTD in nearshore regions. HUCLNet extracts discriminative features from distorted hyperspectral data through contrastive learning, while the self-paced learning strategy selectively prioritizes the most informative samples. Additionally, a reliability-guided clustering strategy enhances the robustness of learned this http URL evaluate the method effectiveness, we conduct a novel nearshore HUTD benchmark dataset, ATR2-HUTD, covering three diverse scenarios with varying water types and turbidity, and target types. Extensive experiments demonstrate that HUCLNet significantly outperforms state-of-the-art methods. The dataset and code will be publicly available at: this https URL

31. 【2502.14493】CrossFuse: Learning Infrared and Visible Image Fusion by Cross-Sensor Top-K Vision Alignment and Beyond

链接https://arxiv.org/abs/2502.14493

作者:Yukai Shi,Cidan Shi,Zhipeng Weng,Yin Tian,Xiaoyu Xian,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:autonomous driving systems, driving systems, increasingly applied, applied in critical, critical fields

备注: IEEE T-CSVT. We mainly discuss the out-of-distribution challenges in infrared and visible image fusion

点击查看摘要

Abstract:Infrared and visible image fusion (IVIF) is increasingly applied in critical fields such as video surveillance and autonomous driving systems. Significant progress has been made in deep learning-based fusion methods. However, these models frequently encounter out-of-distribution (OOD) scenes in real-world applications, which severely impact their performance and reliability. Therefore, addressing the challenge of OOD data is crucial for the safe deployment of these models in open-world environments. Unlike existing research, our focus is on the challenges posed by OOD data in real-world applications and on enhancing the robustness and generalization of models. In this paper, we propose an infrared-visible fusion framework based on Multi-View Augmentation. For external data augmentation, Top-k Selective Vision Alignment is employed to mitigate distribution shifts between datasets by performing RGB-wise transformations on visible images. This strategy effectively introduces augmented samples, enhancing the adaptability of the model to complex real-world scenarios. Additionally, for internal data augmentation, self-supervised learning is established using Weak-Aggressive Augmentation. This enables the model to learn more robust and general feature representations during the fusion process, thereby improving robustness and generalization. Extensive experiments demonstrate that the proposed method exhibits superior performance and robustness across various conditions and environments. Our approach significantly enhances the reliability and stability of IVIF tasks in practical applications.

32. 【2502.14487】mporal Misalignment and Probabilistic Neurons

链接https://arxiv.org/abs/2502.14487

作者:Velibor Bojković,Xiaofeng Wu,Bin Gu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Artificial Neural Networks, Spiking Neural Networks, Neural Networks, biological neural principles, large-scale neural models

备注

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to Artificial Neural Networks (ANNs) by mimicking biological neural principles, establishing them as a promising approach to mitigate the increasing energy demands of large-scale neural models. However, fully harnessing the capabilities of SNNs remains challenging due to their discrete signal processing and temporal dynamics. ANN-SNN conversion has emerged as a practical approach, enabling SNNs to achieve competitive performance on complex machine learning tasks. In this work, we identify a phenomenon in the ANN-SNN conversion framework, termed temporal misalignment, in which random spike rearrangement across SNN layers leads to performance improvements. Based on this observation, we introduce biologically plausible two-phase probabilistic (TPP) spiking neurons, further enhancing the conversion process. We demonstrate the advantages of our proposed method both theoretically and empirically through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet across a variety of architectures, achieving state-of-the-art results.

33. 【2502.14471】Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well

链接https://arxiv.org/abs/2502.14471

作者:Chengyu Fang,Chunming He,Longxiang Tang,Yuelin Zhang,Chenyang Zhu,Yuqi Shen,Chubin Chen,Guoxia Xu,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Camouflaged Object Segmentation, Camouflaged Object, challenging problem due, subtle visual differences, Object Segmentation

备注: 12 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Camouflaged Object Segmentation (COS) remains a challenging problem due to the subtle visual differences between camouflaged objects and backgrounds. Owing to the exceedingly limited visual cues available from visible spectrum, previous RGB single-modality approaches often struggle to achieve satisfactory results, prompting the exploration of multimodal data to enhance detection accuracy. In this work, we present UniCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. UniCOS comprises two key components: a multimodal segmentor, UniSEG, and a cross-modal knowledge learning module, UniLearner. UniSEG employs a state space fusion mechanism to integrate cross-modal features within a unified state space, enhancing contextual understanding and improving robustness to integration of heterogeneous data. Additionally, it includes a fusion-feedback mechanism that facilitate feature extraction. UniLearner exploits multimodal data unrelated to the COS task to improve the segmentation ability of the COS models by generating pseudo-modal content and cross-modal semantic associations. Extensive experiments demonstrate that UniSEG outperforms existing Multimodal COS (MCOS) segmentors, regardless of whether real or pseudo-multimodal COS data is available. Moreover, in scenarios where multimodal COS data is unavailable but multimodal non-COS data is accessible, UniLearner effectively exploits these data to enhance segmentation performance. Our code will be made publicly available on \href{this https URL}{GitHub}.

34. 【2502.14462】Single-image Reflectance and Transmittance Estimation from Any Flatbed Scanner

链接https://arxiv.org/abs/2502.14462

作者:Carlos Rodriguez-Pardo,David Pascual-Hernandez,Javier Rodriguez-Vazquez,Jorge Lopez-Moreno,Elena Garces

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:emerged as promising, promising devices, flatbed scanner, Abstract, Flatbed

备注: Accepted to Computers Graphics

点击查看摘要

Abstract:Flatbed scanners have emerged as promising devices for high-resolution, single-image material capture. However, existing approaches assume very specific conditions, such as uniform diffuse illumination, which are only available in certain high-end devices, hindering their scalability and cost. In contrast, in this work, we introduce a method inspired by intrinsic image decomposition, which accurately removes both shading and specularity, effectively allowing captures with any flatbed scanner. Further, we extend previous work on single-image material reflectance capture with the estimation of opacity and transmittance, critical components of full material appearance (SVBSDF), improving the results for any material captured with a flatbed scanner, at a very high resolution and accuracy

35. 【2502.14454】Exploiting Deblurring Networks for Radiance Fields

链接https://arxiv.org/abs/2502.14454

作者:Haeyun Choi,Heemin Yang,Janghyeok Han,Sunghyun Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:radiance field, blurred training views, radiance field deblurring, radiance field construction, field deblurring approach

备注

点击查看摘要

Abstract:In this paper, we propose DeepDeblurRF, a novel radiance field deblurring approach that can synthesize high-quality novel views from blurred training views with significantly reduced training time. DeepDeblurRF leverages deep neural network (DNN)-based deblurring modules to enjoy their deblurring performance and computational efficiency. To effectively combine DNN-based deblurring and radiance field construction, we propose a novel radiance field (RF)-guided deblurring and an iterative framework that performs RF-guided deblurring and radiance field construction in an alternating manner. Moreover, DeepDeblurRF is compatible with various scene representations, such as voxel grids and 3D Gaussians, expanding its applicability. We also present BlurRF-Synth, the first large-scale synthetic dataset for training radiance field deblurring frameworks. We conduct extensive experiments on both camera motion blur and defocus blur, demonstrating that DeepDeblurRF achieves state-of-the-art novel-view synthesis quality with significantly reduced training time.

36. 【2502.14442】Stochastic Resonance Improves the Detection of Low Contrast Images in Deep Learning Models

链接https://arxiv.org/abs/2502.14442

作者:Siegfried Ludwig

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:types of systems, Stochastic resonance describes, improving the detectability, detectability of weak, weak signals

备注: MSc Course Project

点击查看摘要

Abstract:Stochastic resonance describes the utility of noise in improving the detectability of weak signals in certain types of systems. It has been observed widely in natural and engineered settings, but its utility in image classification with rate-based neural networks has not been studied extensively. In this analysis a simple LSTM recurrent neural network is trained for digit recognition and classification. During the test phase, image contrast is reduced to a point where the model fails to recognize the presence of a stimulus. Controlled noise is added to partially recover classification performance. The results indicate the presence of stochastic resonance in rate-based recurrent neural networks.

37. 【2502.14433】Daily Land Surface Temperature Reconstruction in Landsat Cross-Track Areas Using Deep Ensemble Learning With Uncertainty Quantification

链接https://arxiv.org/abs/2502.14433

作者:Shengjie Liu,Siqin Wang,Lu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:LST, land surface temperature, real-world applications rely, RMSE, Landsat LST

备注

点击查看摘要

Abstract:Many real-world applications rely on land surface temperature (LST) data at high spatiotemporal resolution. In complex urban areas, LST exhibits significant variations, fluctuating dramatically within and across city blocks. Landsat provides high spatial resolution data at 100 meters but is limited by long revisit time, with cloud cover further disrupting data collection. Here, we propose DELAG, a deep ensemble learning method that integrates annual temperature cycles and Gaussian processes, to reconstruct Landsat LST in complex urban areas. Leveraging the cross-track characteristics and dual-satellite operation of Landsat since 2021, we further enhance data availability to 4 scenes every 16 days. We select New York City, London and Hong Kong from three different continents as study areas. Experiments show that DELAG successfully reconstructed LST in the three cities under clear-sky (RMSE = 0.73-0.96 K) and heavily-cloudy (RMSE = 0.84-1.62 K) situations, superior to existing methods. Additionally, DELAG can quantify uncertainty that enhances LST reconstruction reliability. We further tested the reconstructed LST to estimate near-surface air temperature, achieving results (RMSE = 1.48-2.11 K) comparable to those derived from clear-sky LST (RMSE = 1.63-2.02 K). The results demonstrate the successful reconstruction through DELAG and highlight the broader applications of LST reconstruction for estimating accurate air temperature. Our study thus provides a novel and practical method for Landsat LST reconstruction, particularly suited for complex urban areas within Landsat cross-track areas, taking one step toward addressing complex climate events at high spatiotemporal resolution.

38. 【2502.14420】ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

链接https://arxiv.org/abs/2502.14420

作者:Zhongyi Zhou,Yichen Zhu,Minjie Zhu,Junjie Wen,Ning Liu,Zhiyuan Xu,Weibin Meng,Ran Cheng,Yaxin Peng,Chaomin Shen,Feifei Feng

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Humans possess, unified cognitive ability, ability to perceive, physical world, cognitive ability

备注

点击查看摘要

Abstract:Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can't large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks. Notably, it achieves a six times higher performance on MMMU and scores 47.2% on MMStar with a more parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates superior performance on 25 real-world robot manipulation tasks compared to existing VLA methods like OpenVLA. Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.

39. 【2502.14412】Evaluating Precise Geolocation Inference Capabilities of Vision Language Models

链接https://arxiv.org/abs/2502.14412

作者:Neel Jay,Hieu Minh Nguyen,Trung Dung Hoang,Jacob Haimes

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:raises important questions, raises important, Google Street View, prevalence of Vision-Language, important questions

备注: AAAI 2025 Workshop DATASAFE

点击查看摘要

Abstract:The prevalence of Vision-Language Models (VLMs) raises important questions about privacy in an era where visual information is increasingly available. While foundation VLMs demonstrate broad knowledge and learned capabilities, we specifically investigate their ability to infer geographic location from previously unseen image data. This paper introduces a benchmark dataset collected from Google Street View that represents its global distribution of coverage. Foundation models are evaluated on single-image geolocation inference, with many achieving median distance errors of 300 km. We further evaluate VLM "agents" with access to supplemental tools, observing up to a 30.6% decrease in distance error. Our findings establish that modern foundation VLMs can act as powerful image geolocation tools, without being specifically trained for this task. When coupled with increasing accessibility of these models, our findings have greater implications for online privacy. We discuss these risks, as well as future work in this area.

40. 【2502.14397】PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data

链接https://arxiv.org/abs/2502.14397

作者:Shijie Huang,Yiren Song,Yuxuan Zhang,Hailong Guo,Xueyin Wang,Mike Zheng Shou,Jiaming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:overlay decorative elements, facilitate photo doodling, editing framework designed, photo doodling, framework designed

备注

点击查看摘要

Abstract:We introduce PhotoDoodle, a novel image editing framework designed to facilitate photo doodling by enabling artists to overlay decorative elements onto photographs. Photo doodling is challenging because the inserted elements must appear seamlessly integrated with the background, requiring realistic blending, perspective alignment, and contextual coherence. Additionally, the background must be preserved without distortion, and the artist's unique style must be captured efficiently from limited training data. These requirements are not addressed by previous methods that primarily focus on global style transfer or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage training strategy. Initially, we train a general-purpose image editing model, OmniEditor, using large-scale data. Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques. To enhance consistency in the generated results, we introduce a positional encoding reuse mechanism. Additionally, we release a PhotoDoodle dataset featuring six high-quality styles. Extensive experiments demonstrate the advanced performance and robustness of our method in customized image editing, opening new possibilities for artistic creation.

41. 【2502.14377】RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

链接https://arxiv.org/abs/2502.14377

作者:Ke Cao,Jing Wang,Ao Ma,Jiasong Feng,Zhanjie Zhang,Xuanhua He,Shanyuan Liu,Bo Cheng,Dawei Leng,Yuhui Yin,Jie Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformer plays, Diffusion Transformer, diffusion transformer methods, controlled diffusion transformer, Efficient Controllable Generation

备注: 15 pages, 9 figures

点击查看摘要

Abstract:The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta. More examples are available at this https URL.

42. 【2502.14376】A Similarity Paradigm Through Textual Regularization Without Forgetting

链接https://arxiv.org/abs/2502.14376

作者:Fangming Cui,Jan Fong,Rongfei Zeng,Xinmei Tian,Jun Yu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:adapting pre-trained visual-language, pre-trained visual-language models, hand-crafted prompts, adapting pre-trained, pre-trained visual-language

备注

点击查看摘要

Abstract:Prompt learning has emerged as a promising method for adapting pre-trained visual-language models (VLMs) to a range of downstream tasks. While optimizing the context can be effective for improving performance on specific tasks, it can often lead to poor generalization performance on unseen classes or datasets sampled from different distributions. It may be attributed to the fact that textual prompts tend to overfit downstream data distributions, leading to the forgetting of generalized knowledge derived from hand-crafted prompts. In this paper, we propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting. SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework. 1) To avoid forgetting general textual knowledge, we introduce the optimal transport as a textual regularization to finely ensure approximation with hand-crafted features and tuning textual features. 2) In order to continuously unleash the general ability of multiple hand-crafted prompts, we propose a similarity paradigm for natural alignment score and adversarial alignment score to improve model robustness for generalization. Both modules share a common objective in addressing generalization issues, aiming to maximize the generalization capability derived from multiple hand-crafted prompts. Four representative tasks (i.e., non-generalization few-shot learning, base-to-novel generalization, cross-dataset generalization, domain generalization) across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.

43. 【2502.14373】CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

链接https://arxiv.org/abs/2502.14373

作者:Donghao Luo,Yujie Liang,Xu Peng,Xiaobin Hu,Boyuan Jiang,Chengming Xu,Taisong Jin,Chengjie Wang,Yanwei Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging task, remarkable progress, progress in image-based, remains a challenging, robust fitting images

备注

点击查看摘要

Abstract:Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.

44. 【2502.14370】PPO-MI: Efficient Black-Box Model Inversion via Proximal Policy Optimization

链接https://arxiv.org/abs/2502.14370

作者:Xinpeng Shou

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:significant privacy risk, reconstruct private training, private training data, inversion attacks pose, pose a significant

备注: 6 pages, submitting to ICML 2025

点击查看摘要

Abstract:Model inversion attacks pose a significant privacy risk by attempting to reconstruct private training data from trained models. Most of the existing methods either depend on gradient estimation or require white-box access to model parameters, which limits their applicability in practical scenarios. In this paper, we propose PPO-MI, a novel reinforcement learning-based framework for black-box model inversion attacks. Our approach formulates the inversion task as a Markov Decision Process, where an agent navigates the latent space of a generative model to reconstruct private training samples using only model predictions. By employing Proximal Policy Optimization (PPO) with a momentum-based state transition mechanism, along with a reward function balancing prediction accuracy and exploration, PPO-MI ensures efficient latent space exploration and high query efficiency. We conduct extensive experiments illustrates that PPO-MI outperforms the existing methods while require less attack knowledge, and it is robust across various model architectures and datasets. These results underline its effectiveness and generalizability in practical black-box scenarios, raising important considerations for the privacy vulnerabilities of deployed machine learning models.

45. 【2502.14360】Weed Detection using Convolutional Neural Network

链接https://arxiv.org/abs/2502.14360

作者:Santosh Kumar Tripathi,Shivendra Pratap Singh,Devansh Sharma,Harshavardhan U Patekar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:convolutional neural networks, weed detection, neural networks, agricultural land, convolutional neural

备注

点击查看摘要

Abstract:In this paper we use convolutional neural networks (CNNs) for weed detection in agricultural land. We specifically investigate the application of two CNN layer types, Conv2d and dilated Conv2d, for weed detection in crop fields. The suggested method extracts features from the input photos using pre-trained models, which are subsequently adjusted for weed detection. The findings of the experiment, which used a sizable collection of dataset consisting of 15336 segments, being 3249 of soil, 7376 of soybean, 3520 grass and 1191 of broadleaf weeds. show that the suggested approach can accurately and successfully detect weeds at an accuracy of 94%. This study has significant ramifications for lowering the usage of toxic herbicides and increasing the effectiveness of weed management in agriculture.

46. 【2502.14355】riply Laplacian Scale Mixture Modeling for Seismic Data Noise Suppression

链接https://arxiv.org/abs/2502.14355

作者:Sirui Pan(1),Zhiyuan Zha(1),Shigang Wang(1),Yue Li(1),Zipei Fan(2),Gang Yan(3),Binh T. Nguyen(4),Bihan Wen(5),Ce Zhu(6) ((1) College of Communication Engineering, Jilin University, (2) School of Artificial Intelligence, Jilin University, (3) College of Computer Science and Technology, Jilin University, (4) Department of Computer Science, Faculty of Mathematics and Computer Science, University of Science, Vietnam National University, (5) School of Electrical and Electronic Engineering, Nanyang Technological University, (6) Glasgow College, University of Electronic Science and Technology of China)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sparsity-based tensor recovery, shown great potential, seismic data noise, seismic data, data noise suppression

备注

点击查看摘要

Abstract:Sparsity-based tensor recovery methods have shown great potential in suppressing seismic data noise. These methods exploit tensor sparsity measures capturing the low-dimensional structures inherent in seismic data tensors to remove noise by applying sparsity constraints through soft-thresholding or hard-thresholding operators. However, in these methods, considering that real seismic data are non-stationary and affected by noise, the variances of tensor coefficients are unknown and may be difficult to accurately estimate from the degraded seismic data, leading to undesirable noise suppression performance. In this paper, we propose a novel triply Laplacian scale mixture (TLSM) approach for seismic data noise suppression, which significantly improves the estimation accuracy of both the sparse tensor coefficients and hidden scalar parameters. To make the optimization problem manageable, an alternating direction method of multipliers (ADMM) algorithm is employed to solve the proposed TLSM-based seismic data noise suppression problem. Extensive experimental results on synthetic and field seismic data demonstrate that the proposed TLSM algorithm outperforms many state-of-the-art seismic data noise suppression methods in both quantitative and qualitative evaluations while providing exceptional computational efficiency.

47. 【2502.14351】SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images

链接https://arxiv.org/abs/2502.14351

作者:Yichi Zhang,Le Xue,Wenbo Zhang,Lanlan Li,Yuchen Liu,Chen Jiang,Yuan Cheng,Yuan Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Positron Emission Tomography, Positron Emission, Emission Tomography, monitoring treatment progress, modern medical diagnostics

备注

点击查看摘要

Abstract:Positron Emission Tomography (PET) imaging plays a crucial role in modern medical diagnostics by revealing the metabolic processes within a patient's body, which is essential for quantification of therapy response and monitoring treatment progress. However, the segmentation of PET images presents unique challenges due to their lower contrast and less distinct boundaries compared to other structural medical modalities. Recent developments in segmentation foundation models have shown superior versatility across diverse natural image segmentation tasks. Despite the efforts of medical adaptations, these works primarily focus on structural medical images with detailed physiological structural information and exhibit poor generalization ability when adapted to molecular PET imaging. In this paper, we collect and construct PETS-5k, the largest PET segmentation dataset to date, comprising 5,731 three-dimensional whole-body PET images and encompassing over 1.3M 2D images. Based on the established dataset, we develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images. To issue the challenge of discrepant annotation quality of PET images, we adopt a cross prompting confident learning (CPCL) strategy with an uncertainty-guided self-rectification process to robustly learn segmentation from high-quality labeled data and low-quality noisy labeled data. Experimental results demonstrate that SegAnyPET can correctly segment seen and unseen targets using only one or a few prompt points, outperforming state-of-the-art foundation models and task-specific fully supervised models with higher accuracy and strong generalization ability for universal segmentation. As the first foundation model for PET images, we believe that SegAnyPET will advance the applications to various downstream tasks for molecular imaging.

48. 【2502.14344】owards Accurate Binary Spiking Neural Networks: Learning with Adaptive Gradient Modulation Mechanism

链接https://arxiv.org/abs/2502.14344

作者:Yu Liang,Wenjie Wei,Ammar Belatreche,Honglin Cao,Zijian Zhou,Shuai Wang,Malu Zhang,Yang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Spiking Neural Networks, Binary Spiking Neural, Neural Networks, Spiking Neural, Binary Spiking

备注: 9 pages, 8 figures, AAAI conference

点击查看摘要

Abstract:Binary Spiking Neural Networks (BSNNs) inherit the eventdriven paradigm of SNNs, while also adopting the reduced storage burden of binarization techniques. These distinct advantages grant BSNNs lightweight and energy-efficient characteristics, rendering them ideal for deployment on resource-constrained edge devices. However, due to the binary synaptic weights and non-differentiable spike function, effectively training BSNNs remains an open question. In this paper, we conduct an in-depth analysis of the challenge for BSNN learning, namely the frequent weight sign flipping problem. To mitigate this issue, we propose an Adaptive Gradient Modulation Mechanism (AGMM), which is designed to reduce the frequency of weight sign flipping by adaptively adjusting the gradients during the learning process. The proposed AGMM can enable BSNNs to achieve faster convergence speed and higher accuracy, effectively narrowing the gap between BSNNs and their full-precision equivalents. We validate AGMM on both static and neuromorphic datasets, and results indicate that it achieves state-of-the-art results among BSNNs. This work substantially reduces storage demands and enhances SNNs' inherent energy efficiency, making them highly feasible for resource-constrained environments.

49. 【2502.14332】A Collaborative Jade Recognition System for Mobile Devices Based on Lightweight and Large Models

链接https://arxiv.org/abs/2502.14332

作者:Zhenyu Wang,Wenjia Li,Pengyu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:vision-based recognition applications, topic in research, widespread adoption, hot topic, mobile devices

备注

点击查看摘要

Abstract:With the widespread adoption and development of mobile devices, vision-based recognition applications have become a hot topic in research. Jade, as an important cultural heritage and artistic item, has significant applications in fields such as jewelry identification and cultural relic preservation. However, existing jade recognition systems still face challenges in mobile implementation, such as limited computing resources, real-time requirements, and accuracy issues. To address these challenges, this paper proposes a jade recognition system based on size model collaboration, aiming to achieve efficient and accurate jade identification using mobile devices such as this http URL, we design a size model based on multi-scale image processing, extracting key visual information by analyzing jade's dimensions, shapes, and surface textures. Then, a collaborative multi-model classification framework is built by combining deep learning and traditional computer vision algorithms. This framework can effectively select and adjust models based on different jade characteristics, providing high accuracy results across various environments and this http URL results show that the proposed system can provide high recognition accuracy and fast processing time on mobile devices, while consuming relatively low computational resources. The system not only holds great application potential but also provides new ideas and technical support for the intelligent development of jade identification.

50. 【2502.14316】xtured 3D Regenerative Morphing with 3D Diffusion Prior

链接https://arxiv.org/abs/2502.14316

作者:Songlin Yang,Yushi Lan,Honghua Chen,Xingang Pan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:focusing on transitions, shape and texture, morphing creates smooth, morphing, plausible interpolation sequences

备注

点击查看摘要

Abstract:Textured 3D morphing creates smooth and plausible interpolation sequences between two 3D objects, focusing on transitions in both shape and texture. This is important for creative applications like visual effects in filmmaking. Previous methods rely on establishing point-to-point correspondences and determining smooth deformation trajectories, which inherently restrict them to shape-only morphing on untextured, topologically aligned datasets. This restriction leads to labor-intensive preprocessing and poor generalization. To overcome these challenges, we propose a method for 3D regenerative morphing using a 3D diffusion prior. Unlike previous methods that depend on explicit correspondences and deformations, our method eliminates the additional need for obtaining correspondence and uses the 3D diffusion prior to generate morphing. Specifically, we introduce a 3D diffusion model and interpolate the source and target information at three levels: initial noise, model parameters, and condition features. We then explore an Attention Fusion strategy to generate more smooth morphing sequences. To further improve the plausibility of semantic interpolation and the generated 3D surfaces, we propose two strategies: (a) Token Reordering, where we match approximate tokens based on semantic analysis to guide implicit correspondences in the denoising process of the diffusion model, and (b) Low-Frequency Enhancement, where we enhance low-frequency signals in the tokens to improve the quality of generated surfaces. Experimental results show that our method achieves superior smoothness and plausibility in 3D morphing across diverse cross-category object pairs, offering a novel regenerative method for 3D morphing with textured representations.

51. 【2502.14314】ODVerse33: Is the New YOLO Version Always Better? A Multi Domain benchmark from YOLO v5 to v11

链接https://arxiv.org/abs/2502.14314

作者:Tianyou Jiang,Yang Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:building real-time object, YOLO, YOLO versions, building real-time, Abstract

备注: 18 pages, 4 figures, 7 tables

点击查看摘要

Abstract:You Look Only Once (YOLO) models have been widely used for building real-time object detectors across various domains. With the increasing frequency of new YOLO versions being released, key questions arise. Are the newer versions always better than their previous versions? What are the core innovations in each YOLO version and how do these changes translate into real-world performance gains? In this paper, we summarize the key innovations from YOLOv1 to YOLOv11, introduce a comprehensive benchmark called ODverse33, which includes 33 datasets spanning 11 diverse domains (Autonomous driving, Agricultural, Underwater, Medical, Videogame, Industrial, Aerial, Wildlife, Retail, Microscopic, and Security), and explore the practical impact of model improvements in real-world, multi-domain applications through extensive experimental results. We hope this study can provide some guidance to the extensive users of object detection models and give some references for future real-time object detector development.

52. 【2502.14282】PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

链接https://arxiv.org/abs/2502.14282

作者:Haowei Liu,Xi Zhang,Haiyang Xu,Yuyang Wanyan,Junyang Wang,Ming Yan,Ji Zhang,Chunfeng Yuan,Changsheng Xu,Weiming Hu,Fei Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:MLLM-based GUI agents, MLLM-based GUI, complex interactive environment, compared to smartphones, interactive environment

备注: 14 pages, 7 figures

点击查看摘要

Abstract:In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code will be publicly available.

53. 【2502.14279】OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images

链接https://arxiv.org/abs/2502.14279

作者:Zhichao Zheng,Henry Williams,Bruce A MacDonald

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Monocular depth estimation, robotic perception, rudimentary task, task in robotic, Monocular depth

备注: 10 pages, 5 figures, Australasian Conference on Robotics and Automation, ACRA, 2024

点击查看摘要

Abstract:Monocular depth estimation is a rudimentary task in robotic perception. Recently, with the development of more accurate and robust neural network models and different types of datasets, monocular depth estimation has significantly improved performance and efficiency. However, most of the research in this area focuses on very concentrated domains. In particular, most of the benchmarks in outdoor scenarios belong to urban environments for the improvement of autonomous driving devices, and these benchmarks have a massive disparity with the orchard/vineyard environment, which is hardly helpful for research in the primary industry. Therefore, we propose OrchardDepth, which fills the gap in the estimation of the metric depth of the monocular camera in the orchard/vineyard environment. In addition, we present a new retraining method to improve the training result by monitoring the consistent regularization between dense depth maps and sparse points. Our method improves the RMSE of depth estimation in the orchard environment from 1.5337 to 0.6738, proving our method's validation.

54. 【2502.14273】LLM-EvRep: Learning an LLM-Compatible Event Representation Using a Self-Supervised Framework

链接https://arxiv.org/abs/2502.14273

作者:Zongyou Yu,Qiang Qu,Qian Zhang,Nan Zhang,Xiaoming Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:demonstrated significant promise, existing approaches rely, event-driven visual content, Recent advancements, significant promise

备注: 6 pages, 2 figures,Companion Proceedings of the ACM Web Conference 2025 (WWW Companion '25)

点击查看摘要

Abstract:Recent advancements in event-based recognition have demonstrated significant promise, yet most existing approaches rely on extensive training, limiting their adaptability for efficient processing of event-driven visual content. Meanwhile, large language models (LLMs) have exhibited remarkable zero-shot capabilities across diverse domains, but their application to event-based visual recognition remains largely unexplored. To bridge this gap, we propose \textbf{LLM-EvGen}, an event representation generator that produces LLM-compatible event representations \textbf{LLM-EvRep}, thereby enhancing the performance of LLMs on event recognition tasks. The generator is trained using a self-supervised framework, aligning the generated representations with semantic consistency and structural fidelity. Comprehensive experiments were conducted on three datasets: N-ImageNet, N-Caltech101, and N-MNIST. The results demonstrate that our method, \textbf{LLM-EvRep}, outperforms the event-to-video method, E2VID, by 15.93\%, 0.82\%, and 50.21\%, respectively, in recognition tasks when evaluated using GPT-4o.

55. 【2502.14267】Money Recognition for the Visually Impaired: A Case Study on Sri Lankan Banknotes

链接https://arxiv.org/abs/2502.14267

作者:Akshaan Bandara

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sri Lankan currency, Lankan currency notes, financial transactions, Sri Lankan, security in financial

备注

点击查看摘要

Abstract:Currency note recognition is a critical accessibility need for blind individuals, as identifying banknotes accurately can impact their independence and security in financial transactions. Several traditional and technological initiatives have been taken to date. Nevertheless, these approaches are less user-friendly and have made it more challenging for blind people to identify banknotes. This research proposes a user-friendly stand-alone system for the identification of Sri Lankan currency notes. A custom-created dataset of images of Sri Lankan currency notes was used to fine-tune an EfficientDet model. The currency note recognition model achieved 0.9847 AP on the validation dataset and performs exceptionally well in real-world scenarios. The high accuracy and the intuitive interface have enabled blind individuals to quickly and accurately identify currency denominations, ultimately encouraging accessibility and independence.

56. 【2502.14247】Pandora3D: A Comprehensive Framework for High-Quality 3D Shape and Texture Generation

链接https://arxiv.org/abs/2502.14247

作者:Jiayu Yang,Taizhang Shang,Weixuan Sun,Xibin Song,Ziang Chen,Senbo Wang,Shenzhou Chen,Weizhe Liu,Hongdong Li,Pan Ji

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:including single images, including single, text descriptions, presents a comprehensive, shape generation

备注: Tencent XR 3D Gen

点击查看摘要

Abstract:This report presents a comprehensive framework for generating high-quality 3D shapes and textures from diverse input prompts, including single images, multi-view images, and text descriptions. The framework consists of 3D shape generation and texture generation. (1). The 3D shape generation pipeline employs a Variational Autoencoder (VAE) to encode implicit 3D geometries into a latent space and a diffusion network to generate latents conditioned on input prompts, with modifications to enhance model capacity. An alternative Artist-Created Mesh (AM) generation approach is also explored, yielding promising results for simpler geometries. (2). Texture generation involves a multi-stage process starting with frontal images generation followed by multi-view images generation, RGB-to-PBR texture conversion, and high-resolution multi-view texture refinement. A consistency scheduler is plugged into every stage, to enforce pixel-wise consistency among multi-view textures during inference, ensuring seamless integration. The pipeline demonstrates effective handling of diverse input formats, leveraging advanced neural architectures and novel methodologies to produce high-quality 3D content. This report details the system architecture, experimental results, and potential future directions to improve and expand the framework. The source code and pretrained weights are released at: \url{this https URL}.

Comments:
Tencent XR 3D Gen

Subjects:

Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2502.14247 [cs.GR]

(or
arXiv:2502.14247v1 [cs.GR] for this version)

https://doi.org/10.48550/arXiv.2502.14247

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
57. 【2502.14235】OG-Gaussian: Occupancy Based Street Gaussians for Autonomous Driving

链接https://arxiv.org/abs/2502.14235

作者:Yedong Shen,Xinran Zhang,Yifan Duan,Shiqi Zhang,Heng Li,Yilong Wu,Jianmin Ji,Yanyong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:driving simulation environments, Accurate and realistic, autonomous driving simulation, simulation environments, enables the lifelike

备注

点击查看摘要

Abstract:Accurate and realistic 3D scene reconstruction enables the lifelike creation of autonomous driving simulation environments. With advancements in 3D Gaussian Splatting (3DGS), previous studies have applied it to reconstruct complex dynamic driving scenes. These methods typically require expensive LiDAR sensors and pre-annotated datasets of dynamic objects. To address these challenges, we propose OG-Gaussian, a novel approach that replaces LiDAR point clouds with Occupancy Grids (OGs) generated from surround-view camera images using Occupancy Prediction Network (ONet). Our method leverages the semantic information in OGs to separate dynamic vehicles from static street background, converting these grids into two distinct sets of initial point clouds for reconstructing both static and dynamic objects. Additionally, we estimate the trajectories and poses of dynamic objects through a learning-based approach, eliminating the need for complex manual annotations. Experiments on Waymo Open dataset demonstrate that OG-Gaussian is on par with the current state-of-the-art in terms of reconstruction quality and rendering speed, achieving an average PSNR of 35.13 and a rendering speed of 143 FPS, while significantly reducing computational costs and economic overhead.

58. 【2502.14226】Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

链接https://arxiv.org/abs/2502.14226

作者:Vignesh Sundaresha

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Stable-Diffusion and SORA, Apple Vision Pro, video generation models, Diffusion Transformers, model parameters form

备注: 4 pages

点击查看摘要

Abstract:Diffusion Transformers (DiTs) with billions of model parameters form the backbone of popular image and video generation models like DALL.E, Stable-Diffusion and SORA. Though these models are necessary in many low-latency applications like Augmented/Virtual Reality, they cannot be deployed on resource-constrained Edge devices (like Apple Vision Pro or Meta Ray-Ban glasses) due to their huge computational complexity. To overcome this, we turn to knowledge distillation and perform a thorough design-space exploration to achieve the best DiT for a given parameter size. In particular, we provide principles for how to choose design knobs such as depth, width, attention heads and distillation setup for a DiT. During the process, a three-way trade-off emerges between model performance, size and speed that is crucial for Edge implementation of diffusion. We also propose two distillation approaches - Teaching Assistant (TA) method and Multi-In-One (MI1) method - to perform feature distillation in the DiT context. Unlike existing solutions, we demonstrate and benchmark the efficacy of our approaches on practical Edge devices such as NVIDIA Jetson Orin Nano.

59. 【2502.14221】H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging

链接https://arxiv.org/abs/2502.14221

作者:Zhen Huang,Ronghao Xu,Xiaoqian Zhou,Yangbo Wei,Suhua Wang,Xiaoxin Sun,Han Li,Qingsong Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical imaging tasks, subsequent medical imaging, accurately detecting anatomical, critical task, imaging tasks

备注

点击查看摘要

Abstract:3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2502.14221 [cs.CV]

(or
arXiv:2502.14221v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2502.14221

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
60. 【2502.14214】Asymmetric Co-Training for Source-Free Few-Shot Domain Adaptation

链接https://arxiv.org/abs/2502.14214

作者:Gengxu Li,Yuan Wu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:unsupervised domain adaptation, Source-free unsupervised domain, traditional unsupervised domain, gained significant attention, Source-free unsupervised

备注: 13 pages

点击查看摘要

Abstract:Source-free unsupervised domain adaptation (SFUDA) has gained significant attention as an alternative to traditional unsupervised domain adaptation (UDA), which relies on the constant availability of labeled source data. However, SFUDA approaches come with inherent limitations that are frequently overlooked. These challenges include performance degradation when the unlabeled target data fails to meet critical assumptions, such as having a closed-set label distribution identical to that of the source domain, or when sufficient unlabeled target data is unavailable-a common situation in real-world applications. To address these issues, we propose an asymmetric co-training (ACT) method specifically designed for the SFFSDA scenario. SFFSDA presents a more practical alternative to SFUDA, as gathering a few labeled target instances is more feasible than acquiring large volumes of unlabeled target data in many real-world contexts. Our ACT method begins by employing a weak-strong augmentation to enhance data diversity. Then we use a two-step optimization process to train the target model. In the first step, we optimize the label smoothing cross-entropy loss, the entropy of the class-conditional distribution, and the reverse-entropy loss to bolster the model's discriminative ability while mitigating overfitting. The second step focuses on reducing redundancy in the output space by minimizing classifier determinacy disparity. Extensive experiments across four benchmarks demonstrate the superiority of our ACT approach, which outperforms state-of-the-art SFUDA methods and transfer learning techniques. Our findings suggest that adapting a source pre-trained model using only a small amount of labeled target data offers a practical and dependable solution. The code is available at this https URL.

61. 【2502.14209】Spatial and Frequency Domain Adaptive Fusion Network for Image Deblurring

链接https://arxiv.org/abs/2502.14209

作者:Hu Gao,Depeng Dang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image deblurring aims, latent sharp image, Image deblurring, sharp image, domain

备注

点击查看摘要

Abstract:Image deblurring aims to reconstruct a latent sharp image from its corresponding blurred one. Although existing methods have achieved good performance, most of them operate exclusively in either the spatial domain or the frequency domain, rarely exploring solutions that fuse both domains. In this paper, we propose a spatial-frequency domain adaptive fusion network (SFAFNet) to address this limitation. Specifically, we design a gated spatial-frequency domain feature fusion block (GSFFBlock), which consists of three key components: a spatial domain information module, a frequency domain information dynamic generation module (FDGM), and a gated fusion module (GFM). The spatial domain information module employs the NAFBlock to integrate local information. Meanwhile, in the FDGM, we design a learnable low-pass filter that dynamically decomposes features into separate frequency subbands, capturing the image-wide receptive field and enabling the adaptive exploration of global contextual information. Additionally, to facilitate information flow and the learning of complementary representations. In the GFM, we present a gating mechanism (GATE) to re-weight spatial and frequency domain features, which are then fused through the cross-attention mechanism (CAM). Experimental results demonstrate that our SFAFNet performs favorably compared to state-of-the-art approaches on commonly used benchmarks.

62. 【2502.14195】Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition

链接https://arxiv.org/abs/2502.14195

作者:Tianyi Shang,Zhenyu Li,Pengjie Xu,Jinwei Qiao,Gang Chen,Zihan Ruan,Weijun Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mobile robots necessitate, robots necessitate advanced, necessitate advanced natural, accurately identify locations, Mobile robots

备注: 8 pages, 4 figures, conference

点击查看摘要

Abstract:Mobile robots necessitate advanced natural language understanding capabilities to accurately identify locations and perform tasks such as package delivery. However, traditional visual place recognition (VPR) methods rely solely on single-view visual information and cannot interpret human language descriptions. To overcome this challenge, we bridge text and vision by proposing a multiview (360° views of the surroundings) text-vision registration approach called Text4VPR for place recognition task, which is the first method that exclusively utilizes textual descriptions to match a database of images. Text4VPR employs the frozen T5 language model to extract global textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with temperature coefficient to assign local tokens to their respective clusters, thereby aggregating visual descriptors from images. During the training stage, Text4VPR emphasizes the alignment between individual text-image pairs for precise textual description. In the inference stage, Text4VPR uses the Cascaded Cross-Attention Cosine Alignment (CCCA) to address the internal mismatch between text and image groups. Subsequently, Text4VPR performs precisely place match based on the descriptions of text-image groups. On Street360Loc, the first text to image VPR dataset we created, Text4VPR builds a robust baseline, achieving a leading top-1 accuracy of 57% and a leading top-10 accuracy of 92% within a 5-meter radius on the test set, which indicates that localization from textual descriptions to images is not only feasible but also holds significant potential for further advancement, as shown in Figure 1.

63. 【2502.14191】Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models

链接https://arxiv.org/abs/2502.14191

作者:Michihiro Yasunaga,Luke Zettlemoyer,Marjan Ghazvininejad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:assessing output quality, training vision-language models, Reward models play, play an essential, essential role

备注: Dataset available at [this https URL](https://github.com/facebookresearch/multimodal_rewardbench)

点击查看摘要

Abstract:Reward models play an essential role in training vision-language models (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for evaluating multimodal reward models in VLMs. To address this gap, we introduce Multimodal RewardBench, an expert-annotated benchmark covering six domains: general correctness, preference, knowledge, reasoning, safety, and visual question-answering. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various VLMs. In evaluating a range of VLM judges, we find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy. Notably, most models struggle in the reasoning and safety domains. These findings suggest that Multimodal RewardBench offers a challenging testbed for advancing reward model development across multiple domains. We release the benchmark at this https URL.

64. 【2502.14190】Stereo Image Coding for Machines with Joint Visual Feature Compression

链接https://arxiv.org/abs/2502.14190

作者:Dengchao Jin,Jianjun Lei,Bo Peng,Zhaoqing Pan,Nam Ling,Qingming Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:achieved great success, stereo image fields, stereo image coding, image coding, stereo image

备注

点击查看摘要

Abstract:2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSFC-Net) is proposed for SICM, where the stereo visual features are effectively extracted, compressed, and transmitted for 3D visual task. To efficiently compress stereo visual features in MVSFC-Net, a stereo multi-scale feature compression (SMFC) module is designed to gradually transform sparse stereo multi-scale features into compact joint visual representations by removing spatial, inter-view, and cross-scale redundancies simultaneously. Experimental results show that the proposed MVSFC-Net obtains superior compression efficiency as well as 3D visual task performance, when compared with the existing ICM anchors recommended by MPEG and the state-of-the-art SIC method.

65. 【2502.14184】Bayesian SegNet for Semantic Segmentation with Improved Interpretation of Microstructural Evolution During Irradiation of Materials

链接https://arxiv.org/abs/2502.14184

作者:Marjolein Oostrom,Alex Hagen,Nicole LaHaye,Karl Pazdernik

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:tritium-producing burnable absorber, burnable absorber rod, Understanding the relationship, Deep Convolutional Neural, absorber rod performance

备注

点击查看摘要

Abstract:Understanding the relationship between the evolution of microstructures of irradiated LiAlO2 pellets and tritium diffusion, retention and release could improve predictions of tritium-producing burnable absorber rod performance. Given expert-labeled segmented images of irradiated and unirradiated pellets, we trained Deep Convolutional Neural Networks to segment images into defect, grain, and boundary classes. Qualitative microstructural information was calculated from these segmented images to facilitate the comparison of unirradiated and irradiated pellets. We tested modifications to improve the sensitivity of the model, including incorporating meta-data into the model and utilizing uncertainty quantification. The predicted segmentation was similar to the expert-labeled segmentation for most methods of microstructural qualification, including pixel proportion, defect area, and defect density. Overall, the high performance metrics for the best models for both irradiated and unirradiated images shows that utilizing neural network models is a viable alternative to expert-labeled images.

66. 【2502.14178】NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis

链接https://arxiv.org/abs/2502.14178

作者:Xiaoxing Liu,Zhilei Liu,Chongke Bi

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Talking head synthesis, Talking head, lip-synchronized talking head, Prior Aided Audio, Aided Audio Disentanglement

备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Talking head synthesis is to synthesize a lip-synchronized talking head video using audio. Recently, the capability of NeRF to enhance the realism and texture details of synthesized talking heads has attracted the attention of researchers. However, most current NeRF methods based on audio are exclusively concerned with the rendering of frontal faces. These methods are unable to generate clear talking heads in novel views. Another prevalent challenge in current 3D talking head synthesis is the difficulty in aligning acoustic and visual spaces, which often results in suboptimal lip-syncing of the generated talking heads. To address these issues, we propose Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis (NeRF-3DTalker). Specifically, the proposed method employs 3D prior information to synthesize clear talking heads with free views. Additionally, we propose a 3D Prior Aided Audio Disentanglement module, which is designed to disentangle the audio into two distinct categories: features related to 3D awarded speech movements and features related to speaking style. Moreover, to reposition the generated frames that are distant from the speaker's motion space in the real space, we have devised a local-global Standardized Space. This method normalizes the irregular positions in the generated frames from both global and local semantic perspectives. Through comprehensive qualitative and quantitative experiments, it has been demonstrated that our NeRF-3DTalker outperforms state-of-the-art in synthesizing realistic talking head videos, exhibiting superior image quality and lip synchronization. Project page: this https URL.

67. 【2502.14168】Deep learning based infrared small object segmentation: Challenges and future directions

链接https://arxiv.org/abs/2502.14168

作者:Zhengeng Yang,Hongshan Yu,Jianjun Zhang,Qiang Tang,Ajmal Mian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:supporting unmanned systems, Infrared, unmanned systems, vehicles and drones, supporting unmanned

备注: This is a submitted version of a paper accepted by Information Fusion. If you want a better reading experience, please refer to the final published version of Information Fusion

点击查看摘要

Abstract:Infrared sensing is a core method for supporting unmanned systems, such as autonomous vehicles and drones. Recently, infrared sensors have been widely deployed on mobile and stationary platforms for detection and classification of objects from long distances and in wide field of views. Given its success in the vision image analysis domain, deep learning has also been applied for object recognition in infrared images. However, techniques that have proven successful in visible light perception face new challenges in the infrared domain. These challenges include extremely low signal-to-noise ratios in infrared images, very small and blurred objects of interest, and limited availability of labeled/unlabeled training data due to the specialized nature of infrared sensors. Numerous methods have been proposed in the literature for the detection and classification of small objects in infrared images achieving varied levels of success. There is a need for a survey paper that critically analyzes existing techniques in this domain, identifies unsolved challenges and provides future research directions. This paper fills the gap and offers a concise and insightful review of deep learning-based methods. It also identifies the challenges faced by existing infrared object segmentation methods and provides a structured review of existing infrared perception methods from the perspective of these challenges and highlights the motivations behind the various approaches. Finally, this review suggests promising future directions based on recent advancements within this domain.

68. 【2502.14156】Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

链接https://arxiv.org/abs/2502.14156

作者:Katie Z Luo,Minh-Quan Dao,Zhenzhen Liu,Mark Campbell,Wei-Lun Chao,Kilian Q. Weinberger,Ezio Malis,Vincent Fremont,Bharath Hariharan,Mao Shan,Stewart Worrall,Julie Stephany Berrio Perez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single-vehicle perception systems, promising solution, limitations of single-vehicle, Mixed Signals, solution to address

备注

点击查看摘要

Abstract:Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different types of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides precisely aligned point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. Mixed Signals V2X Dataset is one of the highest quality, large-scale datasets publicly available for V2X perception research. Details on the website this https URL.

69. 【2502.14149】PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery

链接https://arxiv.org/abs/2502.14149

作者:Runlong He,Danyal Z. Khan,Evangelos B. Mazomenos,Hani J. Marcus,Danail Stoyanov,Matthew J. Clarkson,Mobarakol Islam

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:visual question answering, enhance intra-operative decision-making, advancing surgical education, Vision-Language Models, promote intuitive interactions

备注: 9 pages

点击查看摘要

Abstract:Vision-Language Models (VLMs) in visual question answering (VQA) offer a unique opportunity to enhance intra-operative decision-making, promote intuitive interactions, and significantly advancing surgical education. However, the development of VLMs for surgical VQA is challenging due to limited datasets and the risk of overfitting and catastrophic forgetting during full fine-tuning of pretrained weights. While parameter-efficient techniques like Low-Rank Adaptation (LoRA) and Matrix of Rank Adaptation (MoRA) address adaptation challenges, their uniform parameter distribution overlooks the feature hierarchy in deep networks, where earlier layers, that learn general features, require more parameters than later ones. This work introduces PitVQA++ with an open-ended PitVQA dataset and vector matrix-low-rank adaptation (Vector-MoLoRA), an innovative VLM fine-tuning approach for adapting GPT-2 to pituitary surgery. Open-Ended PitVQA comprises around 101,803 frames from 25 procedural videos with 745,972 question-answer sentence pairs, covering key surgical elements such as phase and step recognition, context understanding, tool detection, localization, and interactions recognition. Vector-MoLoRA incorporates the principles of LoRA and MoRA to develop a matrix-low-rank adaptation strategy that employs vector ranking to allocate more parameters to earlier layers, gradually reducing them in the later layers. Our approach, validated on the Open-Ended PitVQA and EndoVis18-VQA datasets, effectively mitigates catastrophic forgetting while significantly enhancing performance over recent baselines. Furthermore, our risk-coverage analysis highlights its enhanced reliability and trustworthiness in handling uncertain predictions. Our source code and dataset is available at~\url{this https URL}.

70. 【2502.14142】oken Adaptation via Side Graph Convolution for Temporally and Spatially Efficient Fine-tuning of 3D Point Cloud Transformers

链接https://arxiv.org/abs/2502.14142

作者:Takahiko Furuya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point cloud Transformers, Parameter-efficient fine-tuning, point cloud, point cloud analysis, cloud Transformers

备注: Currently under review

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) of pre-trained 3D point cloud Transformers has emerged as a promising technique for 3D point cloud analysis. While existing PEFT methods attempt to minimize the number of tunable parameters, they still suffer from high temporal and spatial computational costs during fine-tuning. This paper proposes a novel PEFT algorithm for 3D point cloud Transformers, called Side Token Adaptation on a neighborhood Graph (STAG), to achieve superior temporal and spatial efficiency. STAG employs a graph convolutional side network that operates in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks. STAG's side network realizes high efficiency through three key components: connection with the backbone that enables reduced gradient computation, parameter sharing framework, and efficient graph convolution. Furthermore, we present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets, enabling comprehensive evaluation of PEFT methods. Extensive experiments using multiple pre-trained models and PCC13 demonstrates the effectiveness of STAG. Specifically, STAG maintains classification accuracy comparable to existing methods while reducing tunable parameters to only 0.43M and achieving significant reductions in both computational time and memory consumption for fine-tuning. Code and benchmark will be available at: this https URL

71. 【2502.14140】ModSkill: Physical Character Skill Modularization

链接https://arxiv.org/abs/2502.14140

作者:Yiming Huang,Zhiyang Dou,Lingjie Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:controlling simulated characters, imitation learning algorithms, Human motion, generalize motor skills, posing challenges

备注

点击查看摘要

Abstract:Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Previous methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts. Our framework features a skill modularization attention layer that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We also propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Our results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks.

72. 【2502.14129】GlossGau: Efficient Inverse Rendering for Glossy Surface with Anisotropic Spherical Gaussian

链接https://arxiv.org/abs/2502.14129

作者:Bang Du,Runfa Blark Li,Chen Du,Truong Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, calibrated photographs represents, Gaussian Splatting, graphics and vision, calibrated photographs

备注

点击查看摘要

Abstract:The reconstruction of 3D objects from calibrated photographs represents a fundamental yet intricate challenge in the domains of computer graphics and vision. Although neural reconstruction approaches based on Neural Radiance Fields (NeRF) have shown remarkable capabilities, their processing costs remain substantial. Recently, the advent of 3D Gaussian Splatting (3D-GS) largely improves the training efficiency and facilitates to generate realistic rendering in real-time. However, due to the limited ability of Spherical Harmonics (SH) to represent high-frequency information, 3D-GS falls short in reconstructing glossy objects. Researchers have turned to enhance the specular expressiveness of 3D-GS through inverse rendering. Yet these methods often struggle to maintain the training and rendering efficiency, undermining the benefits of Gaussian Splatting techniques. In this paper, we introduce GlossGau, an efficient inverse rendering framework that reconstructs scenes with glossy surfaces while maintaining training and rendering speeds comparable to vanilla 3D-GS. Specifically, we explicitly model the surface normals, Bidirectional Reflectance Distribution Function (BRDF) parameters, as well as incident lights and use Anisotropic Spherical Gaussian (ASG) to approximate the per-Gaussian Normal Distribution Function under the microfacet model. We utilize 2D Gaussian Splatting (2D-GS) as foundational primitives and apply regularization to significantly alleviate the normal estimation challenge encountered in related works. Experiments demonstrate that GlossGau achieves competitive or superior reconstruction on datasets with glossy surfaces. Compared with previous GS-based works that address the specular surface, our optimization time is considerably less.

73. 【2502.14125】Modular Prompt Learning Improves Vision-Language Models

链接https://arxiv.org/abs/2502.14125

作者:Zhenhan Huang,Tejaswini Pedapati,Pin-Yu Chen,Jianxi Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interpret visual concepts, Pre-trained vision-language models, Prompt learning, inserted prompts, prompts

备注: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing

点击查看摘要

Abstract:Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily adapts them to new scenarios. Compared to fine-tuning, prompt learning enables the model to achieve comparable or better performance using fewer trainable parameters. Besides, prompt learning freezes the pre-trained model and avoids the catastrophic forgetting issue in the fine-tuning. Continuous prompts inserted into the input of every transformer layer (i.e. deep prompts) can improve the performances of pre-trained models on downstream tasks. For i-th transformer layer, the inserted prompts replace previously inserted prompts in the $(i-1)$-th layer. Although the self-attention mechanism contextualizes newly inserted prompts for the current layer and embeddings from the previous layer's output, removing all inserted prompts from the previous layer inevitably loses information contained in the continuous prompts. In this work, we propose Modular Prompt Learning (MPL) that is designed to promote the preservation of information contained in the inserted prompts. We evaluate the proposed method on base-to-new generalization and cross-dataset tasks. On average of 11 datasets, our method achieves 0.7% performance gain on the base-to-new generalization task compared to the state-of-the-art method. The largest improvement on the individual dataset is 10.7% (EuroSAT dataset).

74. 【2502.14113】Object-centric Binding in Contrastive Language-Image Pretraining

链接https://arxiv.org/abs/2502.14113

作者:Rim Assouel,Pietro Astolfi,Florian Bordes,Michal Drozdzal,Adriana Romero-Soriano

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advances, associate visual information, vision language models, advances in vision, vision language

备注

点击查看摘要

Abstract:Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.

75. 【2502.14099】Point Cloud Geometry Scalable Coding Using a Resolution and Quality-conditioned Latents Probability Estimator

链接https://arxiv.org/abs/2502.14099

作者:Daniele Mari,André F. R. Guarda,Nuno M. M. Rodrigues,Simone Milani,Fernando Pereira

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:users consume multimedia, consume multimedia content, current age, users consume, terms of network

备注: Submitted to IEEE and currently under review

点击查看摘要

Abstract:In the current age, users consume multimedia content in very heterogeneous scenarios in terms of network, hardware, and display capabilities. A naive solution to this problem is to encode multiple independent streams, each covering a different possible requirement for the clients, with an obvious negative impact in both storage and computational requirements. These drawbacks can be avoided by using codecs that enable scalability, i.e., the ability to generate a progressive bitstream, containing a base layer followed by multiple enhancement layers, that allow decoding the same bitstream serving multiple reconstructions and visualization specifications. While scalable coding is a well-known and addressed feature in conventional image and video codecs, this paper focuses on a new and very different problem, notably the development of scalable coding solutions for deep learning-based Point Cloud (PC) coding. The peculiarities of this 3D representation make it hard to implement flexible solutions that do not compromise the other functionalities of the codec. This paper proposes a joint quality and resolution scalability scheme, named Scalable Resolution and Quality Hyperprior (SRQH), that, contrary to previous solutions, can model the relationship between latents obtained with models trained for different RD tradeoffs and/or at different resolutions. Experimental results obtained by integrating SRQH in the emerging JPEG Pleno learning-based PC coding standard show that SRQH allows decoding the PC at different qualities and resolutions with a single bitstream while incurring only in a limited RD penalty and increment in complexity w.r.t. non-scalable JPEG PCC that would require one bitstream per coding configuration.

76. 【2502.14092】Hybrid Visual Servoing of Tendon-driven Continuum Robots

链接https://arxiv.org/abs/2502.14092

作者:Rana Danesh,Farrokh Janabi-Sharifi,Farhad Aghili

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:Hybrid Visual Servoing, Visual Servoing, Image-Based Visual Servoing, Learning-Based Visual Servoing, tendon-driven continuum robots

备注

点击查看摘要

Abstract:This paper introduces a novel Hybrid Visual Servoing (HVS) approach for controlling tendon-driven continuum robots (TDCRs). The HVS system combines Image-Based Visual Servoing (IBVS) with Deep Learning-Based Visual Servoing (DLBVS) to overcome the limitations of each method and improve overall performance. IBVS offers higher accuracy and faster convergence in feature-rich environments, while DLBVS enhances robustness against disturbances and offers a larger workspace. By enabling smooth transitions between IBVS and DLBVS, the proposed HVS ensures effective control in dynamic, unstructured environments. The effectiveness of this approach is validated through simulations and real-world experiments, demonstrating that HVS achieves reduced iteration time, faster convergence, lower final error, and smoother performance compared to DLBVS alone, while maintaining DLBVS's robustness in challenging conditions such as occlusions, lighting changes, actuator noise, and physical impacts.

77. 【2502.14088】Regression in EO: Are VLMs Up to the Challenge?

链接https://arxiv.org/abs/2502.14088

作者:Xizhe Xue,Xiao Xiang Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth Observation, remotely sensed information, Vision Language Models, sensed information, featuring multi-sensor

备注

点击查看摘要

Abstract:Earth Observation (EO) data encompass a vast range of remotely sensed information, featuring multi-sensor and multi-temporal, playing an indispensable role in understanding our planet's dynamics. Recently, Vision Language Models (VLMs) have achieved remarkable success in perception and reasoning tasks, bringing new insights and opportunities to the EO field. However, the potential for EO applications, especially for scientific regression related applications remains largely unexplored. This paper bridges that gap by systematically examining the challenges and opportunities of adapting VLMs for EO regression tasks. The discussion first contrasts the distinctive properties of EO data with conventional computer vision datasets, then identifies four core obstacles in applying VLMs to EO regression: 1) the absence of dedicated benchmarks, 2) the discrete-versus-continuous representation mismatch, 3) cumulative error accumulation, and 4) the suboptimal nature of text-centric training objectives for numerical tasks. Next, a series of methodological insights and potential subtle pitfalls are explored. Lastly, we offer some promising future directions for designing robust, domain-aware solutions. Our findings highlight the promise of VLMs for scientific regression in EO, setting the stage for more precise and interpretable modeling of critical environmental processes.

78. 【2502.14070】DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models

链接https://arxiv.org/abs/2502.14070

作者:Daewon Chae,June Suk Choi,Jinkyu Kim,Kimin Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enhancing model performance, reward fine-tuning, reward fine-tuning methods, Fine-tuning, reward

备注: AAAI 2025

点击查看摘要

Abstract:Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

79. 【2502.14068】A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing

链接https://arxiv.org/abs/2502.14068

作者:Shreya Ghosh,Yi-Huan Chen,Ching-Hsiang Huang,Abu Shafin Mohammad Mahdee Jameel,Chien Chou Ho,Aly El Gamal,Samuel Labi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:downstream task, racing-related research, lack of publicly, Indy Autonomous Challenge, raw images

备注: Currently Under Review

点击查看摘要

Abstract:A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at this http URL.

80. 【2502.14064】riad: Vision Foundation Model for 3D Magnetic Resonance Imaging

链接https://arxiv.org/abs/2502.14064

作者:Shansong Wang,Mojtaba Safari,Qiang Li,Chih-Wei Chang,Richard LJ Qiu,Justin Roper,David S. Yu,Xiaofeng Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision foundation models, Vision foundation, diverse types, MRI, tasks

备注

点击查看摘要

Abstract:Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various radiology tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can maximize performance when the data modalities and organs of upstream and downstream tasks are consistent.

81. 【2502.14063】PedDet: Adaptive Spectral Optimization for Multimodal Pedestrian Detection

链接https://arxiv.org/abs/2502.14063

作者:Rui Zhao,Zeyu Zhang,Yi Xu,Yi Yao,Yan Huang,Wenxin Zhang,Zirui Song,Xiuying Chen,Yang Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent transportation systems, made significant progress, critical challenges, insufficient fusion, complex scenarios

备注

点击查看摘要

Abstract:Pedestrian detection in intelligent transportation systems has made significant progress but faces two critical challenges: (1) insufficient fusion of complementary information between visible and infrared spectra, particularly in complex scenarios, and (2) sensitivity to illumination changes, such as low-light or overexposed conditions, leading to degraded performance. To address these issues, we propose PedDet, an adaptive spectral optimization complementarity framework specifically enhanced and optimized for multispectral pedestrian detection. PedDet introduces the Multi-scale Spectral Feature Perception Module (MSFPM) to adaptively fuse visible and infrared features, enhancing robustness and flexibility in feature extraction. Additionally, the Illumination Robustness Feature Decoupling Module (IRFDM) improves detection stability under varying lighting by decoupling pedestrian and background features. We further design a contrastive alignment to enhance intermodal feature discrimination. Experiments on LLVIP and MSDS datasets demonstrate that PedDet achieves state-of-the-art performance, improving the mAP by 6.6% with superior detection accuracy even in low-light conditions, marking a significant step forward for road safety. Code will be available at this https URL.

82. 【2502.14061】EfficientPose 6D: Scalable and Efficient 6D Object Pose Estimation

链接https://arxiv.org/abs/2502.14061

作者:Zixuan Fang,Thomas Pöllabauer,Tristan Wirth,Sarah Berkei,Volker Knauthe,Arjan Kuijper

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:industrial applications requiring, estimation remains critical, requiring real-time feedback, applications requiring real-time, accurate pose estimation

备注

点击查看摘要

Abstract:In industrial applications requiring real-time feedback, such as quality control and robotic manipulation, the demand for high-speed and accurate pose estimation remains critical. Despite advances improving speed and accuracy in pose estimation, finding a balance between computational efficiency and accuracy poses significant challenges in dynamic environments. Most current algorithms lack scalability in estimation time, especially for diverse datasets, and the state-of-the-art (SOTA) methods are often too slow. This study focuses on developing a fast and scalable set of pose estimators based on GDRNPP to meet or exceed current benchmarks in accuracy and robustness, particularly addressing the efficiency-accuracy trade-off essential in real-time scenarios. We propose the AMIS algorithm to tailor the utilized model according to an application-specific trade-off between inference time and accuracy. We further show the effectiveness of the AMIS-based model choice on four prominent benchmark datasets (LM-O, YCB-V, T-LESS, and ITODD).

83. 【2502.14044】Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

链接https://arxiv.org/abs/2502.14044

作者:Yucheng Shi,Quanzheng Li,Jin Sun,Xiang Li,Ninghao Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large multimodal models, shown impressive capabilities, Large multimodal, shown impressive, impressive capabilities

备注: Accepted by ICLR 2025. Code: [this https URL](https://github.com/sycny/SelfSynthX)

点击查看摘要

Abstract:Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

84. 【2502.14023】Dynamic Activation with Knowledge Distillation for Energy-Efficient Spiking NN Ensembles

链接https://arxiv.org/abs/2502.14023

作者:Orestis Konstantaropoulos,Theodoris Mallios,Maria Papadopouli

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:high energy consumption, energy consumption makes, spiking neural networks, classification and decision-making, Spiking Neural Ensemble

备注

点击查看摘要

Abstract:While foundation AI models excel at tasks like classification and decision-making, their high energy consumption makes them unsuitable for energy-constrained applications. Inspired by the brain's efficiency, spiking neural networks (SNNs) have emerged as a viable alternative due to their event-driven nature and compatibility with neuromorphic chips. This work introduces a novel system that combines knowledge distillation and ensemble learning to bridge the performance gap between artificial neural networks (ANNs) and SNNs. A foundation AI model acts as a teacher network, guiding smaller student SNNs organized into an ensemble, called Spiking Neural Ensemble (SNE). SNE enables the disentanglement of the teacher's knowledge, allowing each student to specialize in predicting a distinct aspect of it, while processing the same input. The core innovation of SNE is the adaptive activation of a subset of SNN models of an ensemble, leveraging knowledge-distillation, enhanced with an informed-partitioning (disentanglement) of the teacher's feature space. By dynamically activating only a subset of these student SNNs, the system balances accuracy and energy efficiency, achieving substantial energy savings with minimal accuracy loss. Moreover, SNE is significantly more efficient than the teacher network, reducing computational requirements by up to 20x with only a 2% drop in accuracy on the CIFAR-10 dataset. This disentanglement procedure achieves an accuracy improvement of up to 2.4% on the CIFAR-10 dataset compared to other partitioning schemes. Finally, we comparatively analyze SNE performance under noisy conditions, demonstrating enhanced robustness compared to its ANN teacher. In summary, SNE offers a promising new direction for energy-constrained applications.

85. 【2502.14807】FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

链接https://arxiv.org/abs/2502.14807

作者:Fadillah Maani,Numan Saeed,Tausifa Saleem,Zaid Farooq,Hussain Alasmawi,Werner Diehl,Ameera Mohammad,Gareth Waring,Saudabi Valappi,Leanne Bricker,Mohammad Yaqub

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:fetal ultrasound images, fetal ultrasound, ultrasound images, offering pre-trained models, increasingly effective

备注

点击查看摘要

Abstract:Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.

86. 【2502.14753】MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders

链接https://arxiv.org/abs/2502.14753

作者:Maya Varma,Ashwin Kumar,Rogier van der Sluijs,Sophie Ostmeier,Louis Blankemeier,Pierre Chambon,Christian Bluethgen,Jip Prince,Curtis Langlotz,Akshay Chaudhari

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:capture fine-grained features, Medical images, clinical decision-making, images, Medical

备注

点击查看摘要

Abstract:Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce MedVAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train MedVAE autoencoders using a novel two-stage training approach with 1,052,730 medical images. Across diverse tasks obtained from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) MedVAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain. Our code is available at this https URL.

87. 【2502.14584】Vision Foundation Models in Medical Image Analysis: Advances and Challenges

链接https://arxiv.org/abs/2502.14584

作者:Pengchen Liang,Bin Pu,Haishan Huang,Yiwei Li,Hualiang Wang,Weibo Ma,Qing Chang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Foundation Models, Vision Transformers, Vision Foundation, sparked significant advances, medical image analysis

备注: 17 pages, 1 figure

点击查看摘要

Abstract:The rapid development of Vision Foundation Models (VFMs), particularly Vision Transformers (ViT) and Segment Anything Model (SAM), has sparked significant advances in the field of medical image analysis. These models have demonstrated exceptional capabilities in capturing long-range dependencies and achieving high generalization in segmentation tasks. However, adapting these large models to medical image analysis presents several challenges, including domain differences between medical and natural images, the need for efficient model adaptation strategies, and the limitations of small-scale medical datasets. This paper reviews the state-of-the-art research on the adaptation of VFMs to medical image segmentation, focusing on the challenges of domain adaptation, model compression, and federated learning. We discuss the latest developments in adapter-based improvements, knowledge distillation techniques, and multi-scale contextual feature modeling, and propose future directions to overcome these bottlenecks. Our analysis highlights the potential of VFMs, along with emerging methodologies such as federated learning and model compression, to revolutionize medical image analysis and enhance clinical applications. The goal of this work is to provide a comprehensive overview of current approaches and suggest key areas for future research that can drive the next wave of innovation in medical image segmentation.

88. 【2502.14418】Role of the Pretraining and the Adaptation data sizes for low-resource real-time MRI video segmentation

链接https://arxiv.org/abs/2502.14418

作者:Masoud Thajudeen Tholan,Vinayaka Hegde,Chetan Sharma,Prasanta Kumar Ghosh

类目:Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:Real-time Magnetic Resonance, Magnetic Resonance Imaging, Real-time Magnetic, Resonance Imaging, Magnetic Resonance

备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Real-time Magnetic Resonance Imaging (rtMRI) is frequently used in speech production studies as it provides a complete view of the vocal tract during articulation. This study investigates the effectiveness of rtMRI in analyzing vocal tract movements by employing the SegNet and UNet models for Air-Tissue Boundary (ATB)segmentation tasks. We conducted pretraining of a few base models using increasing numbers of subjects and videos, to assess performance on two datasets. First, consisting of unseen subjects with unseen videos from the same data source, achieving 0.33% and 0.91% (Pixel-wise Classification Accuracy (PCA) and Dice Coefficient respectively) better than its matched condition. Second, comprising unseen videos from a new data source, where we obtained an accuracy of 99.63% and 98.09% (PCA and Dice Coefficient respectively) of its matched condition performance. Here, matched condition performance refers to the performance of a model trained only on the test subjects which was set as a benchmark for the other models. Our findings highlight the significance of fine-tuning and adapting models with limited data. Notably, we demonstrated that effective model adaptation can be achieved with as few as 15 rtMRI frames from any new dataset.

89. 【2502.14401】MedFuncta: Modality-Agnostic Representations Based on Efficient Neural Fields

链接https://arxiv.org/abs/2502.14401

作者:Paul Friedrich,Florentin Bieder,Phlippe C. Cattin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image analysis, voxel-based data representations, focuses on grid, Recent research, image analysis

备注: Code and Dataset: [this https URL](https://github.com/pfriedri/medfuncta)

点击查看摘要

Abstract:Recent research in medical image analysis with deep learning almost exclusively focuses on grid- or voxel-based data representations. We challenge this common choice by introducing MedFuncta, a modality-agnostic continuous data representation based on neural fields. We demonstrate how to scale neural fields from single instances to large datasets by exploiting redundancy in medical signals and by applying an efficient meta-learning approach with a context reduction scheme. We further address the spectral bias in commonly used SIREN activations, by introducing an $\omega_0$-schedule, improving reconstruction quality and convergence speed. We validate our proposed approach on a large variety of medical signals of different dimensions and modalities (1D: ECG; 2D: Chest X-ray, Retinal OCT, Fundus Camera, Dermatoscope, Colon Histopathology, Cell Microscopy; 3D: Brain MRI, Lung CT) and successfully demonstrate that we can solve relevant downstream tasks on these representations. We additionally release a large-scale dataset of 550k annotated neural fields to promote research in this direction.

90. 【2502.14363】opology-Aware Wavelet Mamba for Airway Structure Segmentation in Postoperative Recurrent Nasopharyngeal Carcinoma CT Scans

链接https://arxiv.org/abs/2502.14363

作者:Haishan Huang,Pengchen Liang,Naier Lin,Luxi Wang,Bin Pu,Jianguo Chen,Qing Chang,Xia Shen,Guo Ran

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Nasopharyngeal carcinoma, limited mouth opening, radiotherapy and chemotherapy, joint stiffness, require re-surgery

备注: 20 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Nasopharyngeal carcinoma (NPC) patients often undergo radiotherapy and chemotherapy, which can lead to postoperative complications such as limited mouth opening and joint stiffness, particularly in recurrent cases that require re-surgery. These complications can affect airway function, making accurate postoperative airway risk assessment essential for managing patient care. Accurate segmentation of airway-related structures in postoperative CT scans is crucial for assessing these risks. This study introduces TopoWMamba (Topology-aware Wavelet Mamba), a novel segmentation model specifically designed to address the challenges of postoperative airway risk evaluation in recurrent NPC patients. TopoWMamba combines wavelet-based multi-scale feature extraction, state-space sequence modeling, and topology-aware modules to segment airway-related structures in CT scans robustly. By leveraging the Wavelet-based Mamba Block (WMB) for hierarchical frequency decomposition and the Snake Conv VSS (SCVSS) module to preserve anatomical continuity, TopoWMamba effectively captures both fine-grained boundaries and global structural context, crucial for accurate segmentation in complex postoperative scenarios. Through extensive testing on the NPCSegCT dataset, TopoWMamba achieves an average Dice score of 88.02%, outperforming existing models such as UNet, Attention UNet, and SwinUNet. Additionally, TopoWMamba is tested on the SegRap 2023 Challenge dataset, where it shows a significant improvement in trachea segmentation with a Dice score of 95.26%. The proposed model provides a strong foundation for automated segmentation, enabling more accurate postoperative airway risk evaluation.

91. 【2502.14260】EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

链接https://arxiv.org/abs/2502.14260

作者:Wenhui Zhu,Xuanzhao Dong,Xin Li,Yujian Xiong,Xiwen Chen,Peijie Qiu,Vamsi Krishna Vasa,Zhangsihao Yang,Yi Su,Oana Dumitrascu,Yalin Wang

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:fundus image enhancement, achieved significant success, past decade, fundus image, presents a considerable

备注

点击查看摘要

Abstract:Over the past decade, generative models have achieved significant success in enhancement fundus this http URL, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{this https URL}

92. 【2502.14090】MambaLiteSR: Image Super-Resolution with Low-Rank Mamba using Knowledge Distillation

链接https://arxiv.org/abs/2502.14090

作者:Romina Aalishah,Mozhgan Navardi,Tinoosh Mohsenin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative Artificial Intelligence, Artificial Intelligence, gained significant attention, Generative Artificial, recent years

备注: Special Session: Generative AI on Edge, 26th International Symposium on Quality Electronic Design (ISQED'25)

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) has gained significant attention in recent years, revolutionizing various applications across industries. Among these, advanced vision models for image super-resolution are in high demand, particularly for deployment on edge devices where real-time processing is crucial. However, deploying such models on edge devices is challenging due to limited computing power and memory. In this paper, we present MambaLiteSR, a novel lightweight image Super-Resolution (SR) model that utilizes the architecture of Vision Mamba. It integrates State Space Blocks and a reconstruction module for efficient feature extraction. To optimize efficiency without affecting performance, MambaLiteSR employs knowledge distillation to transfer key insights from a larger Mamba-based teacher model to a smaller student model via hyperparameter tuning. Through mathematical analysis of model parameters and their impact on PSNR, we identify key factors and adjust them accordingly. Our comprehensive evaluation shows that MambaLiteSR outperforms state-of-the-art edge SR methods by reducing power consumption while maintaining competitive PSNR and SSIM scores across benchmark datasets. It also reduces power usage during training via low-rank approximation. Moreover, MambaLiteSR reduces parameters with minimal performance loss, enabling efficient deployment of generative AI models on resource-constrained devices. Deployment on the embedded NVIDIA Jetson Orin Nano confirms the superior balance of MambaLiteSR size, latency, and efficiency. Experiments show that MambaLiteSR achieves performance comparable to both the baseline and other edge models while using 15% fewer parameters. It also improves power consumption by up to 58% compared to state-of-the-art SR edge models, all while maintaining low energy use during training.

93. 【2502.13974】Segmentation-free integration of nuclei morphology and spatial transcriptomics for retinal images

链接https://arxiv.org/abs/2502.13974

作者:Eduard Chelebian,Pratiti Dasgupta,Zainalabedin Samadi,Carolina Wählby,Amjad Askary

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:SEgmentation-Free Integration, study introduces SEFI, spatial transcriptomics data, spatial transcriptomics, integrating morphological features

备注

点击查看摘要

Abstract:This study introduces SEFI (SEgmentation-Free Integration), a novel method for integrating morphological features of cell nuclei with spatial transcriptomics data. Cell segmentation poses a significant challenge in the analysis of spatial transcriptomics data, as tissue-specific structural complexities and densely packed cells in certain regions make it difficult to develop a universal approach. SEFI addresses this by utilizing self-supervised learning to extract morphological features from fluorescent nuclear staining images, enhancing the clustering of gene expression data without requiring segmentation. We demonstrate SEFI on spatially resolved gene expression profiles of the developing retina, acquired using multiplexed single molecule Fluorescence In Situ Hybridization (smFISH). SEFI is publicly available at this https URL.