本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新454篇论文,其中:

  • 自然语言处理61
  • 信息检索7
  • 计算机视觉75

自然语言处理

1. 【2505.10557】MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

链接https://arxiv.org/abs/2505.10557

作者:Ke Wang,Junting Pan,Linda Wei,Aojun Zhou,Weikang Shi,Zimu Lu,Han Xiao,Yunqiao Yang,Houxing Ren,Mingjie Zhan,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Natural language image-caption, training Large Multimodal, training Large, Large Multimodal Models, Natural language

备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at this https URL.

2. 【2505.10554】Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

链接https://arxiv.org/abs/2505.10554

作者:Zhiyuan Hu,Yibo Wang,Hanze Dong,Yuhui Xu,Amrita Saha,Caiming Xiong,Bryan Hooi,Junnan Li

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, Large reasoning, capacity for long, possess a latent, latent capacity

备注: In Progress

点击查看摘要

Abstract:Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: this https URL

3. 【2505.10543】owards a Deeper Understanding of Reasoning Capabilities in Large Language Models

链接https://arxiv.org/abs/2505.10543

作者:Annie Wong,Thomas Bäck,Aske Plaat,Niki van Stein,Anna V. Kononova

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, environments remains unclear, language models, models demonstrate impressive, large language

备注

点击查看摘要

Abstract:While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

4. 【2505.10527】WorldPM: Scaling Human Preference Modeling

链接https://arxiv.org/abs/2505.10527

作者:Binghai Wang,Runji Lin,Keming Lu,Le Yu,Zhenru Zhang,Fei Huang,Chujie Zheng,Kai Dang,Yang Fan,Xingzhang Ren,An Yang,Binyuan Hui,Dayiheng Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Bowen Yu,Jingren Zhou,Junyang Lin

类目:Computation and Language (cs.CL)

关键词:similar laws exist, World Preference Modeling, test loss scales, power law, similar laws

备注

点击查看摘要

Abstract:Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM's scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.

5. 【2505.10526】MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

链接https://arxiv.org/abs/2505.10526

作者:Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Speculative decoding significantly, decoding significantly accelerates, significantly accelerates language, model verifies simultaneously, propose multiple tokens

备注: Main paper: 11 pp., 4 figs., 3 tabs.; Supplementary: 2 pp

点击查看摘要

Abstract:Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM's vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

6. 【2505.10518】Multi-Token Prediction Needs Registers

链接https://arxiv.org/abs/2505.10518

作者:Anastasios Gerontopoulos,Spyros Gidaris,Nikos Komodakis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multi-token prediction, consistently generalized, improving language model, fine-tuning, supervised fine-tuning

备注

点击查看摘要

Abstract:Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: this https URL.

7. 【2505.10507】he Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

链接https://arxiv.org/abs/2505.10507

作者:Benedikt Ebing,Goran Glavaš

类目:Computation and Language (cs.CL)

关键词:target language data, source language data, noisy target language, noisy source language, language data translated

备注

点击查看摘要

Abstract:Translation-based strategies for cross-lingual transfer XLT such as translate-train -- training on noisy target language data translated from the source language -- and translate-test -- evaluating on noisy source language data translated from the target language -- are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

8. 【2505.10495】RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs

链接https://arxiv.org/abs/2505.10495

作者:Vibha Belavadi,Tushar Vatsa,Dewang Sultania,Suhas Suresha,Ishita Verma,Cheng Chen,Tracy Holloway King,Michael Friedrich

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:addresses fine-tuning Large, Large Language Models, fine-tuning Large Language, Large Language, paper addresses fine-tuning

备注: Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing

点击查看摘要

Abstract:This paper addresses fine-tuning Large Language Models (LLMs) for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text language models to generate high-quality synthetic training data. Our architecture's flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.

9. 【2505.10494】Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

链接https://arxiv.org/abs/2505.10494

作者:Yutao Mou,Xiao Deng,Yuxiao Luo,Shikun Zhang,Wei Ye

类目:Computation and Language (cs.CL)

关键词:coding assistant applications, assistant applications driven, LLM code security, large language models, Code security

备注: Accepted by ACL2025 Main Conference

点击查看摘要

Abstract:Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

10. 【2505.10493】CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

链接https://arxiv.org/abs/2505.10493

作者:Shaohan Wang,Licheng Zhang,Zheren Fu,Zhendong Mao

类目:Computation and Language (cs.CL)

关键词:Retrieval-Augmented Generation, RAG system, large language models, RAG, enhance the capabilities

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

11. 【2505.10475】Parallel Scaling Law for Language Models

链接https://arxiv.org/abs/2505.10475

作者:Mouxiang Chen,Binyuan Hui,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Jianling Sun,Junyang Lin,Zhongxin Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:commonly believed, commit a significant, significant space, scaling, scaling language models

备注

点击查看摘要

Abstract:It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

12. 【2505.10465】Superposition Yields Robust Neural Scaling

链接https://arxiv.org/abs/2505.10465

作者:Yizhou liu,Ziming Liu,Jeff Gore

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:larger models perform, today large language, large language models, model, model size

备注: 30 pages, 23 figures

点击查看摘要

Abstract:The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

13. 【2505.10446】Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

链接https://arxiv.org/abs/2505.10446

作者:Zemin Huang,Zhiyang Chen,Zijun Wang,Tiancheng Li,Guo-Jun Qi

类目:Computation and Language (cs.CL)

关键词:Chain of Lateral, diffusion language models, Lateral Thought, outcome-based Reinforcement Learning, Diffusion Chain

备注

点击查看摘要

Abstract:We introduce the \emph{Diffusion Chain of Lateral Thought (DCoLT)}, a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

14. 【2505.10413】Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

链接https://arxiv.org/abs/2505.10413

作者:Jiajie Jin,Xiaoxi Li,Guanting Dong,Yuyao Zhang,Yutao Zhu,Yongkang Wu,Zhonghua Li,Qi Ye,Zhicheng Dou

类目:Computation and Language (cs.CL)

关键词:encounter long-context input, higher inference costs, long-context input scenarios, encounter long-context, long-context input

备注

点击查看摘要

Abstract:Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at this https URL.

15. 【2505.10409】Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

链接https://arxiv.org/abs/2505.10409

作者:Yue Guo,Jae Ho Sohn,Gondy Leroy,Trevor Cohen

类目:Computation and Language (cs.CL)

关键词:Plain language summaries, facilitating effective communication, making complex medical, medical information easier, complex medical information

备注

点击查看摘要

Abstract:Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.

16. 【2505.10402】Rethinking Repetition Problems of LLMs in Code Generation

链接https://arxiv.org/abs/2505.10402

作者:Yihong Dong,Yuchen Liu,Xue Jiang,Zhi Jin,Ge Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:neural language models, code generation, repetition, code, language models

备注: Accepted to ACL 2025 (main)

点击查看摘要

Abstract:With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.

17. 【2505.10389】Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples

链接https://arxiv.org/abs/2505.10389

作者:Benjamin White,Anastasia Shimorina

类目:Computation and Language (cs.CL)

关键词:aspect-based sentiment analysis, paper explores, explores the design, sentiment analysis system, large language models

备注

点击查看摘要

Abstract:This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction -- identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. Using internal datasets, we investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.

18. 【2505.10356】Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli

链接https://arxiv.org/abs/2505.10356

作者:Chunyu Ye,Shaonan Wang

类目:Computation and Language (cs.CL)

关键词:activity offers valuable, offers valuable insights, enables promising applications, brain activity offers, brain-computer interaction

备注

点击查看摘要

Abstract:Decoding thoughts from brain activity offers valuable insights into human cognition and enables promising applications in brain-computer interaction. While prior studies have explored language reconstruction from fMRI data, they are typically limited to single-modality inputs such as images or audio. In contrast, human thought is inherently multimodal. To bridge this gap, we propose a unified and flexible framework for reconstructing coherent language from brain recordings elicited by diverse input modalities-visual, auditory, and textual. Our approach leverages visual-language models (VLMs), using modality-specific experts to jointly interpret information across modalities. Experiments demonstrate that our method achieves performance comparable to state-of-the-art systems while remaining adaptable and extensible. This work advances toward more ecologically valid and generalizable mind decoding.

19. 【2505.10354】LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations

链接https://arxiv.org/abs/2505.10354

作者:Yile Wang,Zhanyu Shen,Hui Huang

类目:Computation and Language (cs.CL)

关键词:natural language processing, field of natural, interpretable text embeddings, embeddings, interpretable

备注: ACL 2025 Findings

点击查看摘要

Abstract:Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at this https URL.

20. 【2505.10320】J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

链接https://arxiv.org/abs/2505.10320

作者:Chenxi Whitehouse,Tianlu Wang,Ping Yu,Xian Li,Jason Weston,Ilia Kulikov,Swarnadeep Saha

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:core solution, models, training, Abstract, judgment

备注: 10 pages, 8 tables, 11 figures

点击查看摘要

Abstract:The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

21. 【2505.10292】StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

链接https://arxiv.org/abs/2505.10292

作者:Daniel A. P. Oliveira,David Martins de Matos

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:storytelling systems struggle, Visual storytelling systems, frequently leading, maintain character identity, storytelling systems

备注: 31 pages, 14 figures

点击查看摘要

Abstract:Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.

22. 【2505.10282】From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

链接https://arxiv.org/abs/2505.10282

作者:Dubai Li,Nan Jiang,Kangping Huang,Ruiqi Tu,Shuyu Ouyang,Huayu Yu,Lin Qiao,Chen Yu,Tianshu Zhou,Danyang Tong,Qian Wang,Mengtao Li,Xiaofeng Zeng,Yu Tian,Xinping Tian,Jingsong Li

类目:Computation and Language (cs.CL)

关键词:reliable scientific foundations, derived from rigorous, data analysis, Clinical, Quicker

备注

点击查看摘要

Abstract:Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker's capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker's strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker's recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.

23. 【2505.10261】he Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

链接https://arxiv.org/abs/2505.10261

作者:Rui Yang,Huitao Li,Matthew Yu Heng Wong,Yuhe Ke,Xin Li,Kunyu Yu,Jingchi Liao,Jonathan Chong Kai Liew,Sabarinath Vinod Nair,Jasmine Chiat Ling Ong,Irene Li,Douglas Teodoro,Chuan Hong,Daniel Shu Wei Ting,Nan Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural language processing, large language models, Natural language, applied to medicine, prominent recently

备注

点击查看摘要

Abstract:Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.

24. 【2505.10260】Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data

链接https://arxiv.org/abs/2505.10260

作者:Poli Apollinaire Nemkova,Solomon Ubani,Mark V. Albert

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:increasingly sophisticated natural, demonstrated remarkable potential, requiring nuanced textual, including tasks requiring, natural language processing

备注

点击查看摘要

Abstract:In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2505.10260 [cs.CL]

(or
arXiv:2505.10260v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.10260

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
25. 【2505.10231】On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

链接https://arxiv.org/abs/2505.10231

作者:Haozhe Luo,Ziyu Zhou,Zixin Shu,Aurélie Pahud de Mortanges,Robert Berke,Mauricio Reyes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Deep neural networks, neural networks excel, Deep neural, prone to biases, demographic groups

备注

点击查看摘要

Abstract:Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at this https URL.

26. 【2505.10222】ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

链接https://arxiv.org/abs/2505.10222

作者:Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Transformer models rely, capture token dependencies, Transformer models, rely on self-attention, self-attention to capture

备注

点击查看摘要

Abstract:Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.

27. 【2505.10218】RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

链接https://arxiv.org/abs/2505.10218

作者:Zongsheng Wang,Kaili Sun,Bowen Wu,Qun Yu,Ying Li,Baoxun Wang

类目:Computation and Language (cs.CL)

关键词:Role-playing conversational agents, face persistent challenges, Role-playing conversational, conversational agents, face persistent

备注

点击查看摘要

Abstract:Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1's superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model's enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.

28. 【2505.10202】VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

链接https://arxiv.org/abs/2505.10202

作者:Jintian Shao,Hongyi Huang,Jiayi Wu,YiMing Cheng,ZhiYu Wu,You Shan,MingKai Zheng

类目:Computation and Language (cs.CL)

关键词:achieved remarkable success, face significant computational, Large Language Models, memory challenges, achieved remarkable

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive softmax or hierarchical softmax introduce structural complexities. In this paper, we propose VQ-Logits, a novel approach that leverages Vector Quantization (VQ) to drastically reduce the parameter count and computational load of the LLM output layer. VQ-Logits replaces the large V * dmodel output embedding matrix with a small, shared codebook of K embedding vectors (K V ). Each token in the vocabulary is mapped to one of these K codebook vectors. The LLM predicts logits over this compact codebook, which are then efficiently "scattered" to the full vocabulary space using the learned or preassigned mapping. We demonstrate through extensive experiments on standard language modeling benchmarks (e.g., WikiText-103, C4) that VQ-Logits can achieve up to 99% parameter reduction in the output layer and 6x speedup in logit computation, with only a marginal 4% increase in perplexity compared to full softmax baselines. We further provide detailed ablation studies on codebook size, initialization, and learning strategies, showcasing the robustness and effectiveness of our approach.

29. 【2505.10185】he CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

链接https://arxiv.org/abs/2505.10185

作者:Seongyun Lee,Seungone Kim,Minju Seo,Yongrae Jo,Dongyoung Go,Hyeonbin Hwang,Jinho Park,Xiang Yue,Sean Welleck,Graham Neubig,Moontae Lee,Minjoon Seo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:capabilities remains limited, modern large language, reasoning strategies underlying, large language models, remains limited

备注: Work in progress

点击查看摘要

Abstract:Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

30. 【2505.10182】Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

链接https://arxiv.org/abs/2505.10182

作者:Yoichi Ishibashi,Taro Yano,Masafumi Oyamada

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, demonstrated significant improvements, Language Models, reinforcement learning

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.

31. 【2505.10143】GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs

链接https://arxiv.org/abs/2505.10143

作者:Longchao Da,Parth Mitesh Shah,Kuan-Ru Liou,Jiaxing Zhang,Hua Wei

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, human decision-making processes, Language Models, decision-making processes

备注: 5 pages, 4 figures, accepted to IJCAI2025 demo track

点击查看摘要

Abstract:Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: "LLMs can make mistakes. Be careful with important info." This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanations, create complications and raise trust issues among users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide Evidence-based response generation. Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent, enhancing the agent's responses with additional knowledge beyond its training corpus. Then we leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval. We demonstrate that our method improves the existing models' performance in terms of identifying the exact evidence in a free-form context, providing a reliable way to examine the resources of LLM's conclusion and help with the judgment of the trustworthiness.

32. 【2505.10118】Why 1 + 1 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

链接https://arxiv.org/abs/2505.10118

作者:Yangfu Li,Hongjian Zhan,Tianyi Chen,Qi Liu,Yue Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:methods target prompt, target prompt alignment, varying relative importance, Existing visual token, pruning methods target

备注: 31 pages,9 figures,conference

点击查看摘要

Abstract:Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.

33. 【2505.10117】Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

链接https://arxiv.org/abs/2505.10117

作者:JieHao Wu,Ziwei Wang,Junjie Sheng,Wenhao Li,Xiangfei Wang,Jun Luo

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Multidimensional Bin Packing, Online Dynamic Multidimensional, Dynamic Multidimensional Bin, Bin Packing, typical Online Dynamic

备注

点击查看摘要

Abstract:In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9\% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.

34. 【2505.10113】What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs

链接https://arxiv.org/abs/2505.10113

作者:Xinlan Yan,Di Wu,Yibin Lei,Christof Monz,Iacer Calixto

类目:Computation and Language (cs.CL)

关键词:English medical question-answering, benchmarking large language, large language models, fine-grained clinical specialties, dataset for benchmarking

备注

点击查看摘要

Abstract:In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.

35. 【2505.10093】From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI

链接https://arxiv.org/abs/2505.10093

作者:Hsuan-Lei Shao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Taiwanese China Studies, Taiwanese China, Mainland China, unique geopolitical position, long standing academic

备注: 4 pages, 4 figures

点击查看摘要

Abstract:Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight this http URL based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.

36. 【2505.10089】XRAG: Cross-lingual Retrieval-Augmented Generation

链接https://arxiv.org/abs/2505.10089

作者:Wei Liu,Sony Trenous,Leonardo F. R. Ribeiro,Bill Byrne,Felix Hieber

类目:Computation and Language (cs.CL)

关键词:propose XRAG, designed to evaluate, cross-lingual Retrieval-Augmented Generation, XRAG, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.

37. 【2505.10081】Designing and Contextualising Probes for African Languages

链接https://arxiv.org/abs/2505.10081

作者:Wisdom Aduah,Francois Meyer

类目:Computation and Language (cs.CL)

关键词:Pretrained language models, advances remain unclear, African languages, Pretrained language, African

备注

点击查看摘要

Abstract:Pretrained language models (PLMs) for African languages are continually improving, but the reasons behind these advances remain unclear. This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages. We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed. We also design control tasks, a way to interpret probe performance, for the MasakhaPOS dataset. We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs. Our results reaffirm previous findings that token-level syntactic information concentrates in middle-to-last layers, while sentence-level semantic information is distributed across all layers. Through control tasks and probing baselines, we confirm that performance reflects the internal knowledge of PLMs rather than probe memorisation. Our study applies established interpretability techniques to African-language PLMs. In doing so, we highlight the internal mechanisms underlying the success of strategies like active learning and multilingual adaptation.

38. 【2505.10066】Dark LLMs: The Growing Threat of Unaligned AI Models

链接https://arxiv.org/abs/2505.10066

作者:Michael Fire,Yitzhak Elbazis,Adi Wasenstein,Lior Rokach

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, rapidly reshape modern, reshape modern life, rapidly reshape

备注

点击查看摘要

Abstract:Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or 'dark' content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.

39. 【2505.10063】CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability

链接https://arxiv.org/abs/2505.10063

作者:Han Peng,Jinhao Jiang,Zican Dong,Wayne Xin Zhao,Lei Fang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Advancements in Large, input context length, Large Language, Language Models

备注

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.

40. 【2505.10013】DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs

链接https://arxiv.org/abs/2505.10013

作者:Lake Yin,Fan Huang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, past few years, training data

备注: 7 pages, 1 figure

点击查看摘要

Abstract:As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas. We demonstrate that this method can statistically validate the presence of implicit bias in LLM behavior and find an inverse trend between question answering accuracy and implicit bias, supporting our argument.

41. 【2505.09949】Advanced Crash Causation Analysis for Freeway Safety: A Large Language Model Approach to Identifying Key Contributing Factors

链接https://arxiv.org/abs/2505.09949

作者:Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Samgyu Yang,Abdulrahman Faden

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Applications (stat.AP)

关键词:severity is essential, strategies to mitigate, mitigate their severity, traffic, traffic safety

备注

点击查看摘要

Abstract:Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model's practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.

42. 【2505.09945】Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

链接https://arxiv.org/abs/2505.09945

作者:Deeksha Prahlad,Chanhee Lee,Dongha Kim,Hokeun Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, allowed numerous applications, language models, numerous applications, conversational assistants

备注: To appear in the Companion Proceedings of the ACM Web Conference 2025 (WWW Companion '25)

点击查看摘要

Abstract:The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.

43. 【2505.09930】Rethinking Prompt Optimizers: From Prompt Merits to Optimization

链接https://arxiv.org/abs/2505.09930

作者:Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Tianjiao Li,Chua Jia Jim Deryl,Mak Lee Onn,Gee Wah Ng,Kezhi Mao

类目:Computation and Language (cs.CL)

关键词:enabling performance improvements, fine-tuning large language, altering model weights, large language models, offers a practical

备注: 20 pages, 14 figures

点击查看摘要

Abstract:Prompt optimization (PO) offers a practical alternative to fine-tuning large language models (LLMs), enabling performance improvements without altering model weights. Existing methods typically rely on advanced, large-scale LLMs like GPT-4 to generate optimized prompts. However, due to limited downward compatibility, verbose, instruction-heavy prompts from advanced LLMs can overwhelm lightweight inference models and degrade response quality. In this work, we rethink prompt optimization through the lens of interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, lightweight, and locally deployable prompt optimizer trained on our preference dataset built from merit-aligned prompts generated by a lightweight LLM. Unlike prior work, MePO avoids online optimization reliance, reduces cost and privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. Our model and dataset are available at: this https URL

44. 【2505.09924】From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

链接https://arxiv.org/abs/2505.09924

作者:Yidan Wang,Yubing Ren,Yanan Cao,Binxing Fang

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Large Language Models, Large Language, rise of Large, Language Models, promising solution

备注

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at \href{this https URL}{this https URL}.

45. 【2505.09921】PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

链接https://arxiv.org/abs/2505.09921

作者:Yidan Wang,Yanan Cao,Yubing Ren,Fang Fang,Zheng Lin,Binxing Fang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, pose inherent privacy, domains but pose

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at \href{this https URL}{this https URL}.

46. 【2505.09902】Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

链接https://arxiv.org/abs/2505.09902

作者:Martin Capdevila,Esteban Villa Turek,Ellen Karina Chumbe Fernandez,Luis Felipe Polo Galvez,Luis Cadavid,Andrea Marroquin,Rebeca Vargas Quesada,Johanna Crew,Nicole Vallejo Galarraga,Christopher Rodriguez,Diego Gutierrez,Radhi Datla

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, America and Spain, Latin America, Large

备注

点击查看摘要

Abstract:Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.

47. 【2505.09901】Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks

链接https://arxiv.org/abs/2505.09901

作者:Ziyuan Zhang,Darcy Wang,Ningyuan Chen,Rodrigo Mansur,Vahid Sarhangian

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Large language models, Large language, simulate or automate, automate human behavior, LLMs

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (EE) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the EE strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the EE strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

48. 【2505.09855】Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers

链接https://arxiv.org/abs/2505.09855

作者:Alexander Y. Ku,Thomas L. Griffiths,Stephanie C.Y. Chan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Transformer models learn, weight modification, model weights, models learn, ICL

备注

点击查看摘要

Abstract:Transformer models learn in two distinct modes: in-weights learning (IWL), encoding knowledge into model weights, and in-context learning (ICL), adapting flexibly to context without weight modification. To better understand the interplay between these learning modes, we draw inspiration from evolutionary biology's analogous adaptive strategies: genetic encoding (akin to IWL, adapting over generations and fixed within an individual's lifetime) and phenotypic plasticity (akin to ICL, enabling flexible behavioral responses to environmental cues). In evolutionary biology, environmental predictability dictates the balance between these strategies: stability favors genetic encoding, while reliable predictive cues promote phenotypic plasticity. We experimentally operationalize these dimensions of predictability and systematically investigate their influence on the ICL/IWL balance in Transformers. Using regression and classification tasks, we show that high environmental stability decisively favors IWL, as predicted, with a sharp transition at maximal stability. Conversely, high cue reliability enhances ICL efficacy, particularly when stability is low. Furthermore, learning dynamics reveal task-contingent temporal evolution: while a canonical ICL-to-IWL shift occurs in some settings (e.g., classification with many classes), we demonstrate that scenarios with easier IWL (e.g., fewer classes) or slower ICL acquisition (e.g., regression) can exhibit an initial IWL phase later yielding to ICL dominance. These findings support a relative-cost hypothesis for explaining these learning mode transitions, establishing predictability as a critical factor governing adaptive strategies in Transformers, and offering novel insights for understanding ICL and guiding training methodologies.

49. 【2505.09852】Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

链接https://arxiv.org/abs/2505.09852

作者:Apollinaire Poli Nemkova,Sarath Chandra Lingareddy,Sagnik Ray Choudhury,Mark V. Albert

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, natural language tasks, Large Language, language tasks, natural language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.

50. 【2505.09825】KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

链接https://arxiv.org/abs/2505.09825

作者:Peiqi Sui,Juan Diego Rodriguez,Philippe Laban,Dean Murphy,Joseph P. Dexter,Richard Jean So,Samuel Baker,Pramit Chaudhuri

类目:Computation and Language (cs.CL)

关键词:close reading, tens of millions, millions of essays, essays are written, written and graded

备注

点击查看摘要

Abstract:Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.

51. 【2505.09807】Exploring the generalization of LLM truth directions on conversational formats

链接https://arxiv.org/abs/2505.09807

作者:Timour Ichmoukhamedov,David Martens

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:recent works argue, universal truth direction, true and false, false statements, statements are linearly

备注

点击查看摘要

Abstract:Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.

52. 【2505.09794】Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

链接https://arxiv.org/abs/2505.09794

作者:J. Moreno-Casanova,J.M. Auñón,A. Mártinez-Pérez,M.E. Pérez-Martínez,M.E. Gas-López

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, Research projects, Health Research Institute, entities, manual extraction

备注

点击查看摘要

Abstract:Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.

53. 【2505.09777】A Survey on Large Language Models in Multimodal Recommender Systems

链接https://arxiv.org/abs/2505.09777

作者:Alejo Lopez-Avila,Jinhua Du

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:integrate heterogeneous user, Multimodal recommender systems, enhance recommendation performance, recommender systems, integrate heterogeneous

备注: 30 pages, 6 figures

点击查看摘要

Abstract:Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of large language models (LLMs) introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained language models (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.

54. 【2505.09738】Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

链接https://arxiv.org/abs/2505.09738

作者:Shaurya Sharthak,Vinayak Pahalwan,Adithya Kamath,Adarsh Shirawalmath

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fixed tokenization schemes, Pretrained language models, Pretrained language, tokenization schemes, performance limitations

备注

点击查看摘要

Abstract:Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

55. 【2505.09724】An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

链接https://arxiv.org/abs/2505.09724

作者:Gino Carmona-Díaz,William Jiménez-Leal,María Alejandra Grisales,Chandra Sripada,Santiago Amaya,Michael Inzlicht,Juan Pablo Bermúdez

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:social media posts, labor-intensive process highly, process highly susceptible, open-ended responses, susceptible to bias

备注: 31 pages, 1 figure

点击查看摘要

Abstract:Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

56. 【2505.09701】VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

链接https://arxiv.org/abs/2505.09701

作者:Xin Liu,Lechen Zhang,Sheza Munir,Yiyang Gu,Lu Wang

类目:Computation and Language (cs.CL)

关键词:Large language models, remains challenging due, Large language, factuality remains challenging, complex inter-sentence dependencies

备注

点击查看摘要

Abstract:Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

57. 【2505.09666】System Prompt Optimization with Meta-Learning

链接https://arxiv.org/abs/2505.09666

作者:Yumin Choi,Jinheon Baek,Sung Ju Hwang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, shown remarkable capabilities, user prompts

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.

58. 【2505.09665】ales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling

链接https://arxiv.org/abs/2505.09665

作者:Sulong Zhou,Qunying Huang,Shaoheng Zhou,Yun Hang,Xinyue Ye,Aodong Mei,Kathryn Phung,Yuning Ye,Uma Govindswamy,Zehan Li

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)

关键词:Los Angeles wildfires, recent years, severe in recent, latent topics, health

备注

点击查看摘要

Abstract:Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events.

59. 【2505.09662】Large Language Models Are More Persuasive Than Incentivized Human Persuaders

链接https://arxiv.org/abs/2505.09662

作者:Philipp Schoenegger,Francesco Salvi,Jiacheng Liu,Xiaoli Nan,Ramit Debnath,Barbara Fasolo,Evelina Leivada,Gabriel Recchia,Fritz Günther,Ali Zarifhonarvar,Joe Kwon,Zahoor Ul Islam,Marco Dehnert,Daryl Y. H. Lee,Madeline G. Reinecke,David G. Kamper,Mert Kobaş,Adam Sandford,Jonas Kgomo,Luke Hewitt,Shreya Kapoor,Kerem Oktar,Eyup Engin Kucuk,Bo Feng,Cameron R. Jones,Izzy Gainsburg,Sebastian Olschewski,Nora Heinzelmann,Francisco Cruz,Ben M. Tappin,Tao Ma,Peter S. Park,Rayan Onyonka,Arthur Hjorth,Peter Slattery,Qingcheng Zeng,Lennart Finke,Igor Grossmann,Alessandro Salatiello,Ezra Karger

类目:Computation and Language (cs.CL)

关键词:Claude Sonnet, large language model, frontier large language, real-time conversational quiz, conversational quiz setting

备注

点击查看摘要

Abstract:We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

60. 【2505.09655】DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

链接https://arxiv.org/abs/2505.09655

作者:Xiwen Chen,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hao Wang,Haiyu Wu,Huayu Li,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi

类目:Computation and Language (cs.CL)

关键词:Relative Policy Optimization, Group Relative Policy, Policy Optimization, Group Relative, Relative Policy

备注

点击查看摘要

Abstract:Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at this https URL.

61. 【2505.09649】Next Word Suggestion using Graph Neural Network

链接https://arxiv.org/abs/2505.09649

作者:Abisha Thapa Magar,Anup Shakya

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Natural Language Processing, task in Natural, Language Processing, Natural Language, Language Modeling

备注

点击查看摘要

Abstract:Language Modeling is a prevalent task in Natural Language Processing. The currently existing most recent and most successful language models often tend to build a massive model with billions of parameters, feed in a tremendous amount of text data, and train with enormous computation resources which require millions of dollars. In this project, we aim to address an important sub-task in language modeling, i.e., context embedding. We propose an approach to exploit the Graph Convolution operation in GNNs to encode the context and use it in coalition with LSTMs to predict the next word given a local context of preceding words. We test this on the custom Wikipedia text corpus using a very limited amount of resources and show that this approach works fairly well to predict the next word.

信息检索

1. 【2505.10212】Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

链接https://arxiv.org/abs/2505.10212

作者:Dario Di Palma,Felice Antonio Merra,Maurizio Sfilio,Vito Walter Anelli,Fedelucio Narducci,Tommaso Di Noia

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, remarkable natural language, natural language understanding, Large Language, natural language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly central to recommendation scenarios due to their remarkable natural language understanding and generation capabilities. Although significant research has explored the use of LLMs for various recommendation tasks, little effort has been dedicated to verifying whether they have memorized public recommendation dataset as part of their training data. This is undesirable because memorization reduces the generalizability of research findings, as benchmarking on memorized datasets does not guarantee generalization to unseen datasets. Furthermore, memorization can amplify biases, for example, some popular items may be recommended more frequently than others. In this work, we investigate whether LLMs have memorized public recommendation datasets. Specifically, we examine two model families (GPT and Llama) across multiple sizes, focusing on one of the most widely used dataset in recommender systems: MovieLens-1M. First, we define dataset memorization as the extent to which item attributes, user profiles, and user-item interactions can be retrieved by prompting the LLMs. Second, we analyze the impact of memorization on recommendation performance. Lastly, we examine whether memorization varies across model families and model sizes. Our results reveal that all models exhibit some degree of memorization of MovieLens-1M, and that recommendation performance is related to the extent of memorization. We have made all the code publicly available at: this https URL

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2505.10212 [cs.IR]

(or
arXiv:2505.10212v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2505.10212

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3726302.3730178

Focus to learn more

            DOI(s) linking to related resources</p>
2. 【2505.10043】Boosting Text-to-Chart Retrieval through Training with Synthesized Semantic Insights

链接https://arxiv.org/abs/2505.10043

作者:Yifan Wu,Lutao Yan,Yizhang Zhu,Yinan Mei,Jiannan Wang,Nan Tang,Yuyu Luo

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Business Intelligence, important for Business, find relevant charts, increasingly important, find relevant

备注

点击查看摘要

Abstract:Charts are crucial for data analysis and this http URL-to-chart retrieval systems have become increasingly important for Business Intelligence (BI), where users need to find relevant charts that match their analytical needs. These needs can be categorized into precise queries that are well-specified and fuzzy queries that are more exploratory -- both require understanding the semantics and context of the charts. However, existing text-to-chart retrieval solutions often fail to capture the semantic content and contextual information of charts, primarily due to the lack of comprehensive metadata (or semantic insights). To address this limitation, we propose a training data development pipeline that automatically synthesizes hierarchical semantic insights for charts, covering visual patterns (visual-oriented), statistical properties (statistics-oriented), and practical applications (task-oriented), which produces 207,498 semantic insights for 69,166 charts. Based on these, we train a CLIP-based model named ChartFinder to learn better representations of charts for text-to-chart retrieval. Our method leverages rich semantic insights during the training phase to develop a model that understands both visual and semantic aspects of this http URL evaluate text-to-chart retrieval performance, we curate the first benchmark, CRBench, for this task with 21,862 charts and 326 text queries from real-world BI applications, with ground-truth labels verified by the crowd this http URL show that ChartFinder significantly outperforms existing methods in text-to-chart retrieval tasks across various settings. For precise queries, ChartFinder achieves up to 66.9% NDCG@10, which is 11.58% higher than state-of-the-art models. In fuzzy query tasks, our method also demonstrates consistent improvements, with an average increase of 5% across nearly all metrics.

3. 【2505.09861】LiDDA: Data Driven Attribution at LinkedIn

链接https://arxiv.org/abs/2505.09861

作者:John Bencina,Erkut Aykutlug,Yue Chen,Zerui Zhang,Stephanie Sorenson,Shao Tang,Changshuai Wei

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Methodology (stat.ME)

关键词:Data Driven Attribution, assigns conversion credits, causal patterns learned, Driven Attribution, marketing interactions based

备注

点击查看摘要

Abstract:Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foundation of modern marketing intelligence and vital to any marketing businesses and advertising platform. In this paper, we introduce a unified transformer-based attribution approach that can handle member-level data, aggregate-level data, and integration of external macro factors. We detail the large scale implementation of the approach at LinkedIn, showcasing significant impact. We also share learning and insights that are broadly applicable to the marketing and ad tech fields.

4. 【2505.09847】Causal Predictive Optimization and Generation for Business AI

链接https://arxiv.org/abs/2505.09847

作者:Liyang Zhao,Olurotimi Seton,Himadeep Reddy Reddivari,Suvendu Jena,Shadow Zhao,Rachit Kumar,Changshuai Wei

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)

关键词:functions converting leads, involves sales functions, sales functions converting, process involves sales, sales process involves

备注

点击查看摘要

Abstract:The sales process involves sales functions converting leads or opportunities to customers and selling more products to existing customers. The optimization of the sales process thus is key to success of any B2B business. In this work, we introduce a principled approach to sales optimization and business AI, namely the Causal Predictive Optimization and Generation, which includes three layers: 1) prediction layer with causal ML 2) optimization layer with constraint optimization and contextual bandit 3) serving layer with Generative AI and feedback-loop for system enhancement. We detail the implementation and deployment of the system in LinkedIn, showcasing significant wins over legacy systems and sharing learning and insight broadly applicable to this field.

5. 【2505.09795】Beyond Pairwise Learning-To-Rank At Airbnb

链接https://arxiv.org/abs/2505.09795

作者:Malay Haldar,Daochen Zha,Huiji Gao,Liwei He,Sanjeev Katariya

类目:Information Retrieval (cs.IR)

关键词:sort items accurately, logical consistency, handle a large, items, sort items

备注

点击查看摘要

Abstract:There are three fundamental asks from a ranking algorithm: it should scale to handle a large number of items, sort items accurately by their utility, and impose a total order on the items for logical consistency. But here's the catch-no algorithm can achieve all three at the same time. We call this limitation the SAT theorem for ranking algorithms. Given the dilemma, how can we design a practical system that meets user needs? Our current work at Airbnb provides an answer, with a working solution deployed at scale. We start with pairwise learning-to-rank (LTR) models-the bedrock of search ranking tech stacks today. They scale linearly with the number of items ranked and perform strongly on metrics like NDCG by learning from pairwise comparisons. They are at a sweet spot of performance vs. cost, making them an ideal choice for several industrial applications. However, they have a drawback-by ignoring interactions between items, they compromise on accuracy. To improve accuracy, we create a "true" pairwise LTR model-one that captures interactions between items during pairwise comparisons. But accuracy comes at the expense of scalability and total order, and we discuss strategies to counter these challenges. For greater accuracy, we take each item in the search result, and compare it against the rest of the items along two dimensions: (1) Superiority: How strongly do searchers prefer the given item over the remaining ones? (2) Similarity: How similar is the given item to all the other items? This forms the basis of our "all-pairwise" LTR framework, which factors in interactions across all items at once. Looking at items on the search result page all together-superiority and similarity combined-gives us a deeper understanding of what searchers truly want. We quantify the resulting improvements in searcher experience through offline and online experiments at Airbnb.

6. 【2505.09777】A Survey on Large Language Models in Multimodal Recommender Systems

链接https://arxiv.org/abs/2505.09777

作者:Alejo Lopez-Avila,Jinhua Du

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:integrate heterogeneous user, Multimodal recommender systems, enhance recommendation performance, recommender systems, integrate heterogeneous

备注: 30 pages, 6 figures

点击查看摘要

Abstract:Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of large language models (LLMs) introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained language models (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.

7. 【2505.09776】he Impact of International Collaborations with Highly Publishing Countries in Computer Science

链接https://arxiv.org/abs/2505.09776

作者:Alberto Gomez Espes,Michael Faerber,Adam Jatowt

类目:Information Retrieval (cs.IR)

关键词:European Union, United States, paper analyzes international, Computer Science, High Development Index

备注

点击查看摘要

Abstract:This paper analyzes international collaborations in Computer Science, focusing on three major players: China, the European Union, and the United States. Drawing from a comprehensive literature review, we examine collaboration patterns, research impact, retraction rates, and the role of the Development Index in shaping research outcomes. Our findings show that while China, the EU, and the US lead global research efforts, other regions are narrowing the gap in publication volume. Collaborations involving these key regions tend to have lower retraction rates, reflecting stronger adherence to scientific standards. We also find that countries with a Very High Development Index contribute to research with higher citation rates and fewer retractions. Overall, this study highlights the value of international collaboration and the importance of inclusive, ethical practices in advancing global research in Computer Science.

计算机视觉

1. 【2505.10566】3D-Fixup: Advancing Photo Editing with 3D Priors

链接https://arxiv.org/abs/2505.10566

作者:Yen-Chi Cheng,Krishna Kumar Singh,Jae Shin Yoon,Alex Schwing,Liangyan Gui,Matheus Gadelha,Paul Guerrero,Nanxuan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant advances, advances in modeling, modeling image priors, image editing remains, editing remains challenging

备注: SIGGRAPH 2025. Project page: [this https URL](https://3dfixup.github.io/)

点击查看摘要

Abstract:Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at this https URL

2. 【2505.10565】Depth Anything with Any Prior

链接https://arxiv.org/abs/2505.10565

作者:Zehan Wang,Siyu Chen,Lihe Yang,Jialei Wang,Ziang Zhang,Hengshuang Zhao,Zhou Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complete geometric structures, precise metric information, work presents Prior, generating accurate, Depth

备注: Home page: [this https URL](https://prior-depth-anything.github.io/)

点击查看摘要

Abstract:This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.

3. 【2505.10562】End-to-End Vision Tokenizer Tuning

链接https://arxiv.org/abs/2505.10562

作者:Wenxuan Wang,Fan Zhang,Yufeng Cui,Haiwen Diao,Zhuoyan Luo,Huchuan Lu,Jing Liu,Xinlong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual question answering, vision tokenization isolates, vision tokenizer, vision, vision tokenization

备注

点击查看摘要

Abstract:Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

4. 【2505.10558】Style Customization of Text-to-Vector Generation with Image Diffusion Priors

链接https://arxiv.org/abs/2505.10558

作者:Peiying Zhang,Nanxuan Zhao,Jing Liao

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Scalable Vector Graphics, well-organized layer structure, Scalable Vector, Vector Graphics, layer structure

备注: Accepted by SIGGRAPH 2025 (Conference Paper). Project page: [this https URL](https://customsvg.github.io)

点击查看摘要

Abstract:Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics. Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is this https URL.

Comments:
Accepted by SIGGRAPH 2025 (Conference Paper). Project page: this https URL

Subjects:

Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2505.10558 [cs.GR]

(or
arXiv:2505.10558v1 [cs.GR] for this version)

https://doi.org/10.48550/arXiv.2505.10558

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2505.10557】MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

链接https://arxiv.org/abs/2505.10557

作者:Ke Wang,Junting Pan,Linda Wei,Aojun Zhou,Weikang Shi,Zimu Lu,Han Xiao,Yunqiao Yang,Houxing Ren,Mingjie Zhan,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Natural language image-caption, training Large Multimodal, training Large, Large Multimodal Models, Natural language

备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at this https URL.

6. 【2505.10551】Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

链接https://arxiv.org/abs/2505.10551

作者:Yiwen Liu,Jessica Bader,Jae Myung Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:data achieve progressively, photorealistic diffusion models, progressively better results, development of photorealistic, trained in part

备注: CVPRW 2025

点击查看摘要

Abstract:With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model's ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing feasibility is necessary when generating synthetic training data for CLIP-based classifiers, focusing on three target attributes: background, color, and texture. We introduce VariReal, a pipeline that minimally edits a given source image to include feasible or infeasible attributes given by the textual prompt generated by a large language model. Our experiments show that feasibility minimally affects LoRA-fine-tuned CLIP performance, with mostly less than 0.3% difference in top-1 accuracy across three fine-grained datasets. Also, the attribute matters on whether the feasible/infeasible images adversarially influence the classification performance. Finally, mixing feasible and infeasible images in training datasets does not significantly impact performance compared to using purely feasible or infeasible datasets.

7. 【2505.10541】Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

链接https://arxiv.org/abs/2505.10541

作者:Pengfei Wang,Guohai Xu,Weinong Wang,Junjie Yang,Jie Lou,Yunhua Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, comprehend multi-image information

备注

点击查看摘要

Abstract:Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.

8. 【2505.10533】Enhancing Multi-Image Question Answering via Submodular Subset Selection

链接https://arxiv.org/abs/2505.10533

作者:Aaryan Sharma,Shivansh Gupta,Samar Agarwal,Vishak Prasad C.,Ganesh Ramakrishnan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Question Answering scenario, Image Question Answering, Multiple Image Question, Question Answering, Answering scenario

备注

点击查看摘要

Abstract:Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.

9. 【2505.10526】MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

链接https://arxiv.org/abs/2505.10526

作者:Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Speculative decoding significantly, decoding significantly accelerates, significantly accelerates language, model verifies simultaneously, propose multiple tokens

备注: Main paper: 11 pp., 4 figs., 3 tabs.; Supplementary: 2 pp

点击查看摘要

Abstract:Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM's vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

10. 【2505.10518】Multi-Token Prediction Needs Registers

链接https://arxiv.org/abs/2505.10518

作者:Anastasios Gerontopoulos,Spyros Gidaris,Nikos Komodakis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multi-token prediction, consistently generalized, improving language model, fine-tuning, supervised fine-tuning

备注

点击查看摘要

Abstract:Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: this https URL.

11. 【2505.10497】MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

链接https://arxiv.org/abs/2505.10497

作者:Iurii Medvedev,Nuno Goncalves

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring secure authentication, applications requiring secure, deep learning techniques, face morphing attacks, learning techniques

备注

点击查看摘要

Abstract:Face recognition has evolved significantly with the advancement of deep learning techniques, enabling its widespread adoption in various applications requiring secure authentication. However, this progress has also increased its exposure to presentation attacks, including face morphing, which poses a serious security threat by allowing one identity to impersonate another. Therefore, modern face recognition systems must be robust against such attacks. In this work, we propose a novel approach for training deep networks for face recognition with enhanced robustness to face morphing attacks. Our method modifies the classification task by introducing a dual-branch classification strategy that effectively handles the ambiguity in the labeling of face morphs. This adaptation allows the model to incorporate morph images into the training process, improving its ability to distinguish them from bona fide samples. Our strategy has been validated on public benchmarks, demonstrating its effectiveness in enhancing robustness against face morphing attacks. Furthermore, our approach is universally applicable and can be integrated into existing face recognition training pipelines to improve classification-based recognition methods.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2505.10497 [cs.CV]

(or
arXiv:2505.10497v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2505.10497

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
12. 【2505.10496】CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

链接https://arxiv.org/abs/2505.10496

作者:Raman Dutt,Pedro Sanchez,Yongchen Yao,Steven McDonagh,Sotirios A. Tsaftaris,Timothy Hospedales

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simultaneously assesses fidelity, rigorous and multifaceted, simultaneously assesses, synthetic chest radiograph, multifaceted evaluation framework

备注

点击查看摘要

Abstract:We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at this https URL

13. 【2505.10483】UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

链接https://arxiv.org/abs/2505.10483

作者:Yi Li,Haonan Wang,Qixiang Zhang,Boyu Xiao,Chenchang Hu,Hualiang Wang,Xiaomeng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rapidly attracting attention, enhance instruction-following capabilities, minimizing model redundancy, rapidly attracting, attracting attention

备注: UniEval is the first evaluation framework designed for unified multimodal models, including a holistic benchmark UniBench and the UniScore metric

点击查看摘要

Abstract:The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral's unique values.

14. 【2505.10481】Logos as a Well-Tempered Pre-train for Sign Language Recognition

链接https://arxiv.org/abs/2505.10481

作者:Ilya Ovodov,Petr Surovtsev,Karina Kvanchiani,Alexander Kapitanov,Alexander Nagaev

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sign language recognition, isolated sign language, Russian Sign Language, paper examines, examines two aspects

备注

点击查看摘要

Abstract:This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.

15. 【2505.10473】Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

链接https://arxiv.org/abs/2505.10473

作者:Fengdi Zhang,Hongkun Cao,Ruqi Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving high rendering, computational costs, seeks to minimize, introducing an inherent, reduce storage

备注

点击查看摘要

Abstract:To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs such as model deployment under diverse hardware and communication constraints. Here, we present ControlGS, a 3DGS optimization method that achieves semantically meaningful and cross-scene consistent quantity-quality control while maintaining strong quantity-quality performance. Through a single training run using a fixed setup and a user-specified hyperparameter reflecting quantity-quality preference, ControlGS can automatically find desirable quantity-quality trade-off points across diverse scenes, from compact objects to large outdoor scenes. It also outperforms baselines by achieving higher rendering quality with fewer Gaussians, and supports a broad adjustment range with stepless control over the trade-off.

16. 【2505.10457】SEAL: Searching Expandable Architectures for Incremental Learning

链接https://arxiv.org/abs/2505.10457

作者:Matteo Gambella,Vicente Javier Castro Solar,Manuel Roveri

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:machine learning paradigm, Deep Neural Networks, sequential stream, Neural Architecture Search, learning

备注: 8 pages, 5 figures

点击查看摘要

Abstract:Incremental learning is a machine learning paradigm where a model learns from a sequential stream of tasks. This setting poses a key challenge: balancing plasticity (learning new tasks) and stability (preserving past knowledge). Neural Architecture Search (NAS), a branch of AutoML, automates the design of the architecture of Deep Neural Networks and has shown success in static settings. However, existing NAS-based approaches to incremental learning often rely on expanding the model at every task, making them impractical in resource-constrained environments. In this work, we introduce SEAL, a NAS-based framework tailored for data-incremental learning, a scenario where disjoint data samples arrive sequentially and are not stored for future access. SEAL adapts the model structure dynamically by expanding it only when necessary, based on a capacity estimation metric. Stability is preserved through cross-distillation training after each expansion step. The NAS component jointly searches for both the architecture and the optimal expansion policy. Experiments across multiple benchmarks demonstrate that SEAL effectively reduces forgetting and enhances accuracy while maintaining a lower model size compared to prior methods. These results highlight the promise of combining NAS and selective expansion for efficient, adaptive learning in incremental scenarios.

17. 【2505.10453】Vision language models have difficulty recognizing virtual objects

链接https://arxiv.org/abs/2505.10453

作者:Tyler Tran,Sangeet Khemlani,J.G. Trafton

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision language models, process multimodal input, Vision language, vision encoders, language models

备注

点击查看摘要

Abstract:Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

18. 【2505.10441】PIF: Anomaly detection via preference embedding

链接https://arxiv.org/abs/2505.10441

作者:Filippo Leveni,Luca Magri,Giacomo Boracchi,Cesare Alippi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:structured patterns, address the problem, problem of detecting, detecting anomalies, anomalies with respect

备注: Accepted at International Conference on Pattern Recognition (ICPR 2020)

点击查看摘要

Abstract:We address the problem of detecting anomalies with respect to structured patterns. To this end, we conceive a novel anomaly detection method called PIF, that combines the advantages of adaptive isolation methods with the flexibility of preference embedding. Specifically, we propose to embed the data in a high dimensional space where an efficient tree-based method, PI-Forest, is employed to compute an anomaly score. Experiments on synthetic and real datasets demonstrate that PIF favorably compares with state-of-the-art anomaly detection techniques, and confirm that PI-Forest is better at measuring arbitrary distances and isolate points in the preference space.

19. 【2505.10420】Learned Lightweight Smartphone ISP with Unpaired Data

链接https://arxiv.org/abs/2505.10420

作者:Andrei Arhire,Radu Timofte

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Image Signal Processor, Signal Processor, modern smartphone cameras, smartphone cameras responsible, Image Signal

备注: Accepted at CVPRW 2025

点击查看摘要

Abstract:The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at this https URL .

20. 【2505.10352】SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

链接https://arxiv.org/abs/2505.10352

作者:Shihao Zou,Qingfeng Li,Wei Ji,Jingjing Li,Yongkui Yang,Guoqi Li,Chao Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Spiking Neural Networks, Artificial Neural Networks, Neural Networks, Spiking Neural, Artificial Neural

备注: Accepted by ICML 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. this https URL

21. 【2505.10351】A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

链接https://arxiv.org/abs/2505.10351

作者:Jie Zhu,Jirong Zha,Ding Li,Leye Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learning shows promise, significant privacy concerns, harnessing extensive unlabeled, confronts significant privacy, Self-supervised learning shows

备注: An extension of our ACM CCS2024 conference paper ( [arXiv:2404.02462](https://arxiv.org/abs/2404.02462) ). We show the impacts of scaling from both data and model aspects on membership inference for self-supervised visual encoders

点击查看摘要

Abstract:Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at this https URL.

22. 【2505.10312】SOS: A Shuffle Order Strategy for Data Augmentation in Industrial Human Activity Recognition

链接https://arxiv.org/abs/2505.10312

作者:Anh Tuan Ha,Hoang Khang Phan,Thai Minh Tien Ngo,Anh Phan Truong,Nhat Tan Le

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:obtaining high quality, Human Activity Recognition, Generative Adversarial Networks, persistent challenge due, realm of Human

备注

点击查看摘要

Abstract:In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 $\pm$ 0.03 and a macro F1 score of 0.64 $\pm$ 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.

23. 【2505.10294】MIPHEI-ViT: Multiplex Immunofluorescence Prediction from HE Images using ViT Foundation Models

链接https://arxiv.org/abs/2505.10294

作者:Guillaume Balezo,Roger Trullo,Albert Pla Planas,Etienne Decenciere,Thomas Walter

类目:Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)

关键词:Hematoxylin and Eosin, staining routinely acquired, visualize cell morphology, Multiplex Immunofluorescence Prediction, multiplex immunofluorescence

备注

点击查看摘要

Abstract:Histopathological analysis is a cornerstone of cancer diagnosis, with Hematoxylin and Eosin (HE) staining routinely acquired for every patient to visualize cell morphology and tissue architecture. On the other hand, multiplex immunofluorescence (mIF) enables more precise cell type identification via proteomic markers, but has yet to achieve widespread clinical adoption due to cost and logistical constraints. To bridge this gap, we introduce MIPHEI (Multiplex Immunofluorescence Prediction from HE), a U-Net-inspired architecture that integrates state-of-the-art ViT foundation models as encoders to predict mIF signals from HE images. MIPHEI targets a comprehensive panel of markers spanning nuclear content, immune lineages (T cells, B cells, myeloid), epithelium, stroma, vasculature, and proliferation. We train our model using the publicly available ORION dataset of restained HE and mIF images from colorectal cancer tissue, and validate it on two independent datasets. MIPHEI achieves accurate cell-type classification from HE alone, with F1 scores of 0.88 for Pan-CK, 0.57 for CD3e, 0.56 for SMA, 0.36 for CD68, and 0.30 for CD20, substantially outperforming both a state-of-the-art baseline and a random classifier for most markers. Our results indicate that our model effectively captures the complex relationships between nuclear morphologies in their tissue context, as visible in HE images and molecular markers defining specific cell types. MIPHEI offers a promising step toward enabling cell-type-aware analysis of large-scale HE datasets, in view of uncovering relationships between spatial cellular organization and patient outcomes.

24. 【2505.10292】StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

链接https://arxiv.org/abs/2505.10292

作者:Daniel A. P. Oliveira,David Martins de Matos

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:storytelling systems struggle, Visual storytelling systems, frequently leading, maintain character identity, storytelling systems

备注: 31 pages, 14 figures

点击查看摘要

Abstract:Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.

25. 【2505.10289】MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

链接https://arxiv.org/abs/2505.10289

作者:Yue Wang,Shuai Xu,Xuelin Zhu,Yicong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Compositional Zero-Shot Learning, recognize unseen state-object, Compositional Zero-Shot, Zero-Shot Learning, unseen state-object combinations

备注: 9 pages, 5 figures

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at this https URL.

26. 【2505.10281】MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

链接https://arxiv.org/abs/2505.10281

作者:Mengqiu Xu,Kaixin Chen,Heng Guo,Yixiang Huang,Ming Wu,Zhenwei Shi,Chuang Zhang,Jun Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning approaches, Deep learning, demonstrating significant scientific, outperformed traditional methods, demonstrating significant

备注

点击查看摘要

Abstract:Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{this https URL}{this https URL}.

27. 【2505.10271】RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours

链接https://arxiv.org/abs/2505.10271

作者:Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:radar-only deep learning, deep learning model, forecast lead times, deep learning, short forecast lead

备注

点击查看摘要

Abstract:We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.

28. 【2505.10267】HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

链接https://arxiv.org/abs/2505.10267

作者:Pavel Korotaev,Petr Surovtsev,Alexander Kapitanov,Karina Kvanchiani,Aleksandr Nagaev

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Sign Language, fast hand movements, component of Sign, allowing the interpretation, characterized by fast

备注: [this https URL](https://github.com/ai-forever/handreader)

点击查看摘要

Abstract:Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.

29. 【2505.10258】Inferring Driving Maps by Deep Learning-based Trail Map Extraction

链接https://arxiv.org/abs/2505.10258

作者:Michael Hubbertz,Pascal Colling,Qi Han,Tobias Meisen

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:accurate environmental information, maps offer extensive, accurate environmental, environmental information, crucial and essential

备注: This paper was accepted at the CVPR WAD 2025 Workshop

点击查看摘要

Abstract:High-definition (HD) maps offer extensive and accurate environmental information about the driving scene, making them a crucial and essential element for planning within autonomous driving systems. To avoid extensive efforts from manual labeling, methods for automating the map creation have emerged. Recent trends have moved from offline mapping to online mapping, ensuring availability and actuality of the utilized maps. While the performance has increased in recent years, online mapping still faces challenges regarding temporal consistency, sensor occlusion, runtime, and generalization. We propose a novel offline mapping approach that integrates trails - informal routes used by drivers - into the map creation process. Our method aggregates trail data from the ego vehicle and other traffic participants to construct a comprehensive global map using transformer-based deep learning models. Unlike traditional offline mapping, our approach enables continuous updates while remaining sensor-agnostic, facilitating efficient data transfer. Our method demonstrates superior performance compared to state-of-the-art online mapping approaches, achieving improved generalization to previously unseen environments and sensor configurations. We validate our approach on two benchmark datasets, highlighting its robustness and applicability in autonomous driving systems.

30. 【2505.10257】Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

链接https://arxiv.org/abs/2505.10257

作者:Hao Lu,Jiaqi Tang,Jiyao Wang,Yunfan LU,Xu Cao,Qingyong Hu,Yin Wang,Yuting Zhang,Tianxin Xie,Yunpeng Zhang,Yong Chen,Jiayu.Gao,Bin Huang,Dengbo He,Shuiguang Deng,Hao Chen,Ying-Cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent driving cockpit, intelligent driving, users' comfort, important part, match different users'

备注

点击查看摘要

Abstract:The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.

31. 【2505.10250】ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

链接https://arxiv.org/abs/2505.10250

作者:Wenhao Shen,Wanqi Yin,Xiaofeng Yang,Cheng Chen,Chaoyue Song,Zhongang Cai,Lei Yang,Hao Wang,Guosheng Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherently ill-posed due, Human mesh recovery, Human mesh, ambiguity and occlusions, inherently ill-posed

备注: Accepted by ICML 2025. Code: [this https URL](https://github.com/shenwenhao01/ADHMR)

点击查看摘要

Abstract:Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: this https URL.

32. 【2505.10238】MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

链接https://arxiv.org/abs/2505.10238

作者:Yanbo Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:developed rapidly due, Human image animation, gained increasing attention, image animation, Human image

备注

点击查看摘要

Abstract:Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are provided in the supplementary material and at this anonymous GitHub link: this https URL.

33. 【2505.10231】On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

链接https://arxiv.org/abs/2505.10231

作者:Haozhe Luo,Ziyu Zhou,Zixin Shu,Aurélie Pahud de Mortanges,Robert Berke,Mauricio Reyes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Deep neural networks, neural networks excel, Deep neural, prone to biases, demographic groups

备注

点击查看摘要

Abstract:Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at this https URL.

34. 【2505.10223】Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

链接https://arxiv.org/abs/2505.10223

作者:Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:clinical settings due, real-world clinical settings, curated datasets, leading to performance, trained on curated

备注: Accepted at MIDL 2025

点击查看摘要

Abstract:Medical image segmentation models are often trained on curated datasets, leading to performance degradation when deployed in real-world clinical settings due to mismatches between training and test distributions. While data augmentation techniques are widely used to address these challenges, traditional visually consistent augmentation strategies lack the robustness needed for diverse real-world scenarios. In this work, we systematically evaluate alternative augmentation strategies, focusing on MixUp and Auxiliary Fourier Augmentation. These methods mitigate the effects of multiple variations without explicitly targeting specific sources of distribution shifts. We demonstrate how these techniques significantly improve out-of-distribution generalization and robustness to imaging variations across a wide range of transformations in cardiac cine MRI and prostate MRI segmentation. We quantitatively find that these augmentation methods enhance learned feature representations by promoting separability and compactness. Additionally, we highlight how their integration into nnU-Net training pipelines provides an easy-to-implement, effective solution for enhancing the reliability of medical segmentation models in real-world applications.

35. 【2505.10205】VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

链接https://arxiv.org/abs/2505.10205

作者:Umair Haroon,Ahmad AlMughrabi,Thanasis Zoumpekas,Ricardo Marques,Petia Radeva

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gathering sensor-oriented information, health monitoring applications, leveraging single-purpose hardware, Accurate food volume, medical nutrition management

备注

点击查看摘要

Abstract:Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single-purpose hardware such as 3D scanners, gathering sensor-oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices. To achieve real-world measurement, VolE is a reference- and depth-free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.

36. 【2505.10169】Modeling Saliency Dataset Bias

链接https://arxiv.org/abs/2505.10169

作者:Matthias Kümmerer,Harneet Khanuja,Matthias Bethge

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:approaching gold standard, Recent advances, gold standard performance, standard performance levels, image-based saliency prediction

备注

点击查看摘要

Abstract:Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this inter-dataset gap, with close to 60% attributed to dataset-specific biases. To address this remaining generalization gap, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.

37. 【2505.10152】Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

链接https://arxiv.org/abs/2505.10152

作者:Yikang Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Collaborative Style Augmentation, Federated domain generalization, style augmentation, domain generalization, Multi-source Collaborative Style

备注: IJCAI 2025

点击查看摘要

Abstract:Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.

38. 【2505.10144】VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

链接https://arxiv.org/abs/2505.10144

作者:Xuechang Tu,Lukas Radl,Michael Steiner,Markus Steinberger,Bernhard Kerbl,Fernando de la Torre

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:providing exceptional performance, software-based GPU rasterization, Gaussian Splatting, efficient software-based GPU, novel-view synthesis

备注: I3D'25 (PACMCGIT); Project Page: [this https URL](https://cekavis.site/VRSplat/)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has rapidly become a leading technique for novel-view synthesis, providing exceptional performance through efficient software-based GPU rasterization. Its versatility enables real-time applications, including on mobile and lower-powered devices. However, 3DGS faces key challenges in virtual reality (VR): (1) temporal artifacts, such as popping during head movements, (2) projection-based distortions that result in disturbing and view-inconsistent floaters, and (3) reduced framerates when rendering large numbers of Gaussians, falling below the critical threshold for VR. Compared to desktop environments, these issues are drastically amplified by large field-of-view, constant head movements, and high resolution of head-mounted displays (HMDs). In this work, we introduce VRSplat: we combine and extend several recent advancements in 3DGS to address challenges of VR holistically. We show how the ideas of Mini-Splatting, StopThePop, and Optimal Projection can complement each other, by modifying the individual techniques and core 3DGS rasterizer. Additionally, we propose an efficient foveated rasterizer that handles focus and peripheral areas in a single GPU launch, avoiding redundant computations and improving GPU utilization. Our method also incorporates a fine-tuning step that optimizes Gaussian parameters based on StopThePop depth evaluations and Optimal Projection. We validate our method through a controlled user study with 25 participants, showing a strong preference for VRSplat over other configurations of Mini-Splatting. VRSplat is the first, systematically evaluated 3DGS approach capable of supporting modern VR applications, achieving 72+ FPS while eliminating popping and stereo-disrupting floaters.

39. 【2505.10124】IMITATE: Image Registration with Context for unknown time frame recovery

链接https://arxiv.org/abs/2505.10124

作者:Ziad Kheil,Lucas Robinet,Laurent Risser,Soleakhena Ken

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:registration formalism dedicated, unknown condition-related images, estimation of unknown, unknown condition-related, image registration formalism

备注: IEEE ISBI 2025

点击查看摘要

Abstract:In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at this https URL .

40. 【2505.10118】Why 1 + 1 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

链接https://arxiv.org/abs/2505.10118

作者:Yangfu Li,Hongjian Zhan,Tianyi Chen,Qi Liu,Yue Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:methods target prompt, target prompt alignment, varying relative importance, Existing visual token, pruning methods target

备注: 31 pages,9 figures,conference

点击查看摘要

Abstract:Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.

41. 【2505.10088】MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

链接https://arxiv.org/abs/2505.10088

作者:Yuncheng Guo,Xiaodong Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale pre-trained Vision-Language, advanced transfer learning, pre-trained Vision-Language Models, Large-scale pre-trained, significantly advanced transfer

备注: Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

点击查看摘要

Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM's zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions--particularly across the layers of representation tokens--allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.

42. 【2505.10075】FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

链接https://arxiv.org/abs/2505.10075

作者:Jun Guo,Xiaojian Ma,Yikai Wang,Min Yang,Huaping Liu,Qing Li

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:paper investigates training, world models, paper investigates, investigates training, observations by conditioning

备注: Project page: see [this https URL](https://sharinka0715.github.io/FlowDreamer/)

点击查看摘要

Abstract:This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

43. 【2505.10072】oonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

链接https://arxiv.org/abs/2505.10072

作者:Rui-Yang Ju,Sheng-Yen Huang,Yi-Ping Hung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian blendshapes, animatable head avatars, enabled the real-time, real-time reconstruction, reconstruction of animatable

备注

点击查看摘要

Abstract:The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based framework, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we employ an improved StyleGAN to generate the stylized video from the input video frames, which addresses the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, and efficiently generate high-quality animation in the next stage. In Stage 2 (Gaussian blendshapes synthesis), we learn a stylized neutral head model and a set of expression blendshapes from the generated video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on the benchmark dataset using two styles: Arcane and Pixar.

44. 【2505.10055】PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

链接https://arxiv.org/abs/2505.10055

作者:Ijazul Haq,Yingjie Zhang,Irfan Ali Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Optical Character Recognition, Large Multimodal Models, Character Recognition, Large Multimodal, Optical Character

备注

点击查看摘要

Abstract:This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at this https URL.

45. 【2505.10049】Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

链接https://arxiv.org/abs/2505.10049

作者:Jinlong Fan,Xuepu Zeng,Jing Zhang,Mingming Gong,Yuxiang Yang,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undergone transformative advances, Dynamic scene representation, Dynamic scene, dynamic scene reconstruction, Gaussian splatting techniques

备注

点击查看摘要

Abstract:Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with innovations in differentiable volumetric rendering, these approaches have significantly enhanced the quality of motion representation and dynamic scene reconstruction, thereby garnering substantial attention from the computer vision and graphics communities. This survey presents a systematic analysis of over 200 papers focused on dynamic scene representation using radiance field, spanning the spectrum from implicit neural representations to explicit Gaussian primitives. We categorize and evaluate these works through multiple critical lenses: motion representation paradigms, reconstruction techniques for varied scene dynamics, auxiliary information integration strategies, and regularization approaches that ensure temporal consistency and physical plausibility. We organize diverse methodological approaches under a unified representational framework, concluding with a critical examination of persistent challenges and promising research directions. By providing this comprehensive overview, we aim to establish a definitive reference for researchers entering this rapidly evolving field while offering experienced practitioners a systematic understanding of both conceptual principles and practical frontiers in dynamic scene reconstruction.

46. 【2505.10046】Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

链接https://arxiv.org/abs/2505.10046

作者:Bingda Tang,Boyang Zheng,Xichen Pan,Sayak Paul,Saining Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, understudied design space, design space related, language models, diffusion transformers

备注

点击查看摘要

Abstract:This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

47. 【2505.10030】DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

链接https://arxiv.org/abs/2505.10030

作者:Miit Daga,Dhriti Parikh,Swarna Priya Ramu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:conventional farming practices, farming practices restrict, practices restrict early, restrict early diagnosis, agricultural yield

备注: This paper is accepted for publication in IEEE Access journal and is currently pending revisions before publication

点击查看摘要

Abstract:Coconut tree diseases are a serious risk to agricultural yield, particularly in developing countries where conventional farming practices restrict early diagnosis and intervention. Current disease identification methods are manual, labor-intensive, and non-scalable. In response to these limitations, we come up with DeepSeqCoco, a deep learning based model for accurate and automatic disease identification from coconut tree images. The model was tested under various optimizer settings, such as SGD, Adam, and hybrid configurations, to identify the optimal balance between accuracy, minimization of loss, and computational cost. Results from experiments indicate that DeepSeqCoco can achieve as much as 99.5% accuracy (achieving up to 5% higher accuracy than existing models) with the hybrid SGD-Adam showing the lowest validation loss of 2.81%. It also shows a drop of up to 18% in training time and up to 85% in prediction time for input images. The results point out the promise of the model to improve precision agriculture through an AI-based, scalable, and efficient disease monitoring system.

48. 【2505.10027】ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

链接https://arxiv.org/abs/2505.10027

作者:Shijie Lyu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remote sensing technology, practical significance, super-resolution image reconstruction, rapid advancement, great research

备注: Accepted by the 4th International Conference on Computing Innovation and Applied Physics (CONF-CIAP 2025), and will be published in EAI Community Research Series-CORE or Theoretical and Natural Science (TNS)

点击查看摘要

Abstract:With the rapid advancement of remote sensing technology, super-resolution image reconstruction is of great research and practical significance. Existing deep learning methods have made progress but still face limitations in handling complex scenes and preserving image details. This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution. The method constructs a reinforcement learning environment with states, actions, and rewards, optimizing decision objectives through proximal policy optimization (PPO) during the reverse denoising process of the LDM model. Experiments on the RESISC45 dataset show significant improvements over the baseline model in PSNR, SSIM, and LPIPS, with PSNR increasing by 3-4dB, SSIM improving by 0.08-0.11, and LPIPS reducing by 0.06-0.10, particularly in structured and complex natural scenes. The results demonstrate the method's effectiveness in enhancing super-resolution quality and adaptability across scenes.

49. 【2505.10016】Application of YOLOv8 in monocular downward multiple Car Target detection

链接https://arxiv.org/abs/2505.10016

作者:Shijie Lyu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:car driving methods, progressively transforming traditional, transforming traditional car, traditional car driving, modern transportation

备注: Accepted by the 5th International Conference on Signal Processing and Machine Learning (CONF-SPML 2025), to appear in Applied and Computational Engineering

点击查看摘要

Abstract:Autonomous driving technology is progressively transforming traditional car driving methods, marking a significant milestone in modern transportation. Object detection serves as a cornerstone of autonomous systems, playing a vital role in enhancing driving safety, enabling autonomous functionality, improving traffic efficiency, and facilitating effective emergency responses. However, current technologies such as radar for environmental perception, cameras for road perception, and vehicle sensor networks face notable challenges, including high costs, vulnerability to weather and lighting conditions, and limited this http URL address these limitations, this paper presents an improved autonomous target detection network based on YOLOv8. By integrating structural reparameterization technology, a bidirectional pyramid structure network model, and a novel detection pipeline into the YOLOv8 framework, the proposed approach achieves highly efficient and precise detection of multi-scale, small, and remote objects. Experimental results demonstrate that the enhanced model can effectively detect both large and small objects with a detection accuracy of 65%, showcasing significant advancements over traditional this http URL improved model holds substantial potential for real-world applications and is well-suited for autonomous driving competitions, such as the Formula Student Autonomous China (FSAC), particularly excelling in scenarios involving single-target and small-object detection.

50. 【2505.09998】From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

链接https://arxiv.org/abs/2505.09998

作者:Ying Zang,Yuanqi Hu,Xinyu Chen,Yuxia Xu,Suhui Wang,Chunan Yu,Lanyun Zhu,Deyi Ji,Xin Xu,Tianrun Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:people increasingly seek, immersive consumer electronics, smart devices, people increasingly, era of immersive

备注: 8 pages, 5 figures

点击查看摘要

Abstract:In the era of immersive consumer electronics, such as AR/VR headsets and smart devices, people increasingly seek ways to express their identity through virtual fashion. However, existing 3D garment design tools remain inaccessible to everyday users due to steep technical barriers and limited data. In this work, we introduce a 3D sketch-driven 3D garment generation framework that empowers ordinary users - even those without design experience - to create high-quality digital clothing through simple 3D sketches in AR/VR environments. By combining a conditional diffusion model, a sketch encoder trained in a shared latent space, and an adaptive curriculum learning strategy, our system interprets imprecise, free-hand input and produces realistic, personalized garments. To address the scarcity of training data, we also introduce KO3DClothes, a new dataset of paired 3D garments and user-created sketches. Extensive experiments and user studies confirm that our method significantly outperforms existing baselines in both fidelity and usability, demonstrating its promise for democratized fashion design on next-generation consumer platforms.

51. 【2505.09997】Descriptive Image-Text Matching with Graded Contextual Similarity

链接https://arxiv.org/abs/2505.09997

作者:Jinhyun Jang,Jiyeong Lee,Kwanghoon Sohn

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image-text matching aims, visual and textual, textual data, data by learning, Image-text matching

备注

点击查看摘要

Abstract:Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.

52. 【2505.09990】PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

链接https://arxiv.org/abs/2505.09990

作者:Long Cheng,Jiafei Duan,Yi Ru Wang,Haoquan Fang,Boyang Li,Yushan Huang,Elvis Wang,Ainaz Eftekhar,Jason Lee,Wentao Yuan,Rose Hendrix,Noah A. Smith,Fei Xia,Dieter Fox,Ranjay Krishna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:applications spanning robotics, assistive technologies, visual contexts, fundamental and intuitive, intuitive mechanism

备注: 10 Pages, Dataset and code: [this https URL](https://pointarena.github.io/)

点击查看摘要

Abstract:Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: this https URL

53. 【2505.09986】High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

链接https://arxiv.org/abs/2505.09986

作者:Yimin Zhou,Yichong Xia,Sicheng Pan,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:driving extensive research, marine environments, transmission and storage, increasing exploration, exploration and exploitation

备注

点击查看摘要

Abstract:With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.

54. 【2505.09971】APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

链接https://arxiv.org/abs/2505.09971

作者:Yuan Gao,Shaobo Xia,Sheng Nie,Cheng Wang,Xiaohuan Xi,Bisheng Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Airborne laser scanning, Airborne laser, ALS point cloud, ALS point, scene understanding

备注: 18 pages,12 figures

点击查看摘要

Abstract:Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at this https URL.

55. 【2505.09967】KFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

链接https://arxiv.org/abs/2505.09967

作者:Liqian Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging task due, Facial expression recognition, Key Driver Factors, Texture Key Driver, expression recognition

备注

点击查看摘要

Abstract:Facial expression recognition (FER) in the wild remains a challenging task due to the subtle and localized nature of expression-related features, as well as the complex variations in facial appearance. In this paper, we introduce a novel framework that explicitly focuses on Texture Key Driver Factors (TKDF), localized texture regions that exhibit strong discriminative power across emotional categories. By carefully observing facial image patterns, we identify that certain texture cues, such as micro-changes in skin around the brows, eyes, and mouth, serve as primary indicators of emotional dynamics. To effectively capture and leverage these cues, we propose a FER architecture comprising a Texture-Aware Feature Extractor (TAFE) and Dual Contextual Information Filtering (DCIF). TAFE employs a ResNet-based backbone enhanced with multi-branch attention to extract fine-grained texture representations, while DCIF refines these features by filtering context through adaptive pooling and attention mechanisms. Experimental results on RAF-DB and KDEF datasets demonstrate that our method achieves state-of-the-art performance, verifying the effectiveness and robustness of incorporating TKDFs into FER pipelines.

56. 【2505.09965】MambaControl: Anatomy Graph-Enhanced Mamba ControlNet with Fourier Refinement for Diffusion-Based Disease Trajectory Prediction

链接https://arxiv.org/abs/2505.09965

作者:Hao Yang,Tao Tan,Shuai Tan,Weiqin Yang,Kunyan Cai,Calvin Chen,Yue Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precision medicine requires, medicine requires capturing, requires capturing complex, capturing complex spatio-temporal, complex spatio-temporal dynamics

备注

点击查看摘要

Abstract:Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion processes for high-fidelity prediction of medical image trajectories. To better capture subtle structural changes over time while maintaining anatomical consistency, MambaControl combines Mamba-based long-range modelling with graph-guided anatomical control to more effectively represent anatomical correlations. Furthermore, we introduce Fourier-enhanced spectral graph representations to capture spatial coherence and multiscale detail, enabling MambaControl to achieve state-of-the-art performance in Alzheimer's disease prediction. Quantitative and regional evaluations demonstrate improved progression prediction quality and anatomical fidelity, highlighting its potential for personalised prognosis and clinical decision support.

57. 【2505.09943】CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

链接https://arxiv.org/abs/2505.09943

作者:Jiakun Deng,Kexuan Li,Xingye Cui,Jiaxuan Li,Chang Long,Tian Pu,Zhenming Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Infrared small target, Infrared small, plays a critical, military applications, critical role

备注

点击查看摘要

Abstract:Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at this https URL.

58. 【2505.09939】Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

链接https://arxiv.org/abs/2505.09939

作者:Zhe Shan,Lei Zhou,Liu Mao,Shaofan Chen,Chuanqiu Ren,Xia Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:remote sensing change, non-registration change detection, change detection task, change detection, sensing change detection

备注: Accepted to IGARSS 2025

点击查看摘要

Abstract:In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at this https URL.

59. 【2505.09935】VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

链接https://arxiv.org/abs/2505.09935

作者:Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Quoc Dai Tran

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Vulnerable Road Users, human behavior in-thewild, road users, remains crucial, crucial for enhancing

备注

点击查看摘要

Abstract:Understanding and predicting human behavior in-thewild, particularly at urban intersections, remains crucial for enhancing interaction safety between road users. Among the most critical behaviors are crossing intentions of Vulnerable Road Users (VRUs), where misinterpretation may result in dangerous conflicts with oncoming vehicles. In this work, we propose the VRU-CIPI framework with a sequential attention-based model designed to predict VRU crossing intentions at intersections. VRU-CIPI employs Gated Recurrent Unit (GRU) to capture temporal dynamics in VRU movements, combined with a multi-head Transformer self-attention mechanism to encode contextual and spatial dependencies critical for predicting crossing direction. Evaluated on UCF-VRU dataset, our proposed achieves state-of-the-art performance with an accuracy of 96.45% and achieving real-time inference speed reaching 33 frames per second. Furthermore, by integrating with Infrastructure-to-Vehicles (I2V) communication, our approach can proactively enhance intersection safety through timely activation of crossing signals and providing early warnings to connected vehicles, ensuring smoother and safer interactions for all road users.

60. 【2505.09927】DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

链接https://arxiv.org/abs/2505.09927

作者:Siqi Yin,Shaolei Liu,Manning Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:performance degradation caused, Domain, target domain, model performance degradation, target

备注

点击查看摘要

Abstract:Domain adaptation addresses the challenge of model performance degradation caused by domain gaps. In the typical setup for unsupervised domain adaptation, labeled data from a source domain and unlabeled data from a target domain are used to train a target model. However, access to labeled source domain data, particularly in medical datasets, can be restricted due to privacy policies. As a result, research has increasingly shifted to source-free domain adaptation (SFDA), which requires only a pretrained model from the source domain and unlabeled data from the target domain data for adaptation. Existing SFDA methods often rely on domain-specific image style translation and self-supervision techniques to bridge the domain gap and train the target domain model. However, the quality of domain-specific style-translated images and pseudo-labels produced by these methods still leaves room for improvement. Moreover, training the entire model during adaptation can be inefficient under limited supervision. In this paper, we propose a novel SFDA framework to address these challenges. Specifically, to effectively mitigate the impact of domain gap in the initial training phase, we introduce preadaptation to generate a preadapted model, which serves as an initialization of target model and allows for the generation of high-quality enhanced pseudo-labels without introducing extra parameters. Additionally, we propose a data-dependent frequency prompt to more effectively translate target domain images into a source-like style. To further enhance adaptation, we employ a style-related layer fine-tuning strategy, specifically designed for SFDA, to train the target model using the prompted target domain images and pseudo-labels. Extensive experiments on cross-modality abdominal and cardiac SFDA segmentation tasks demonstrate that our proposed method outperforms existing state-of-the-art methods.

61. 【2505.09926】AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

链接https://arxiv.org/abs/2505.09926

作者:Bin-Bin Gao,Yue Zhu,Jiangtao Yan,Yuezhi Cai,Weixi Zhang,Meng Wang,Jun Liu,Yong Liu,Lei Wang,Chengjie Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Universal visual anomaly, unseen vision domains, open scenarios, Universal visual, aims to identify

备注: 27 pages, 15 figures, 22 tables

点击查看摘要

Abstract:Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at this https URL.

62. 【2505.09915】Large-Scale Gaussian Splatting SLAM

链接https://arxiv.org/abs/2505.09915

作者:Zhe Xin,Chenyang Wu,Penghui Huang,Yanyong Zhang,Yinian Mao,Guoquan Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Neural Radiance Fields, Radiance Fields, developed Neural Radiance, recently developed Neural, Neural Radiance

备注

点击查看摘要

Abstract:The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: this https URL.

63. 【2505.09859】Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

链接https://arxiv.org/abs/2505.09859

作者:Andrew Jun Lee,Taylor Webb,Trevor Bihl,Keith Holyoak,Hongjing Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Probabilistic Schema Induction, structured representations, human cognition, ability to learn, hallmark of human

备注: Lee, A. J., Webb, T., Bihl, T., Holyoak, K. J., Lu, H. (2025). Few-shot learning of visual compositional concepts through probabilistic schema induction. In A. Ruggeri, D. Barner, C. Walker, N. Bramley (Eds.), Proceedings of the 47th Annual Conference of the Cognitive Science Society. Cognitive Science Society

点击查看摘要

Abstract:The ability to learn new visual concepts from limited examples is a hallmark of human cognition. While traditional category learning models represent each example as an unstructured feature vector, compositional concept learning is thought to depend on (1) structured representations of examples (e.g., directed graphs consisting of objects and their relations) and (2) the identification of shared relational structure across examples through analogical mapping. Here, we introduce Probabilistic Schema Induction (PSI), a prototype model that employs deep learning to perform analogical mapping over structured representations of only a handful of examples, forming a compositional concept called a schema. In doing so, PSI relies on a novel conception of similarity that weighs object-level similarity and relational similarity, as well as a mechanism for amplifying relations relevant to classification, analogous to selective attention parameters in traditional models. We show that PSI produces human-like learning performance and outperforms two controls: a prototype model that uses unstructured feature vectors extracted from a deep learning model, and a variant of PSI with weaker structured representations. Notably, we find that PSI's human-like performance is driven by an adaptive strategy that increases relational similarity over object-level similarity and upweights the contribution of relations that distinguish classes. These findings suggest that structured representations and analogical mapping are critical to modeling rapid human-like learning of compositional visual concepts, and demonstrate how deep learning can be leveraged to create psychological models.

64. 【2505.09858】Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

链接https://arxiv.org/abs/2505.09858

作者:Danush Kumar Venkatesh,Isabel Funke,Micha Pfeiffer,Fiona Kolbinger,Hanna Maria Schmeiser,Juergen Weitz,Marius Distler,Stefanie Speidel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Computer-assisted interventions, improve intra-operative guidance, deep learning methods, interventions can improve, deep learning

备注: Early accept at MICCAI 2025

点击查看摘要

Abstract:Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at this https URL.

65. 【2505.09829】BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

链接https://arxiv.org/abs/2505.09829

作者:Tushar Kataria,Shireen Y. Elhabian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Obtaining large-scale medical, stringent privacy regulations, Obtaining large-scale, data protection policies, protection policies

备注

点击查看摘要

Abstract:Obtaining large-scale medical data, annotated or unannotated, is challenging due to stringent privacy regulations and data protection policies. In addition, annotating medical images requires that domain experts manually delineate anatomical structures, making the process both time-consuming and costly. As a result, semi-supervised methods have gained popularity for reducing annotation costs. However, the performance of semi-supervised methods is heavily dependent on the availability of unannotated data, and their effectiveness declines when such data are scarce or absent. To overcome this limitation, we propose a simple, yet effective and computationally efficient approach for medical image segmentation that leverages only existing annotations. We propose BoundarySeg , a multi-task framework that incorporates organ boundary prediction as an auxiliary task to full organ segmentation, leveraging consistency between the two task predictions to provide additional supervision. This strategy improves segmentation accuracy, especially in low data regimes, allowing our method to achieve performance comparable to or exceeding state-of-the-art semi supervised approaches all without relying on unannotated data or increasing computational demands. Code will be released upon acceptance.

66. 【2505.09827】Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

链接https://arxiv.org/abs/2505.09827

作者:Julian Tanke,Takashi Shibuya,Kengo Uchida,Koichi Saito,Yuki Mitsufuji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating realistic dyadic, presents significant challenges, exceed typical training, Generating realistic, dyadic human motion

备注: CVPR 2025 HuMoGen Workshop

点击查看摘要

Abstract:Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.

67. 【2505.09819】Visual Feedback of Pattern Separability Improves Myoelectric Decoding Performance of Upper Limb Prostheses

链接https://arxiv.org/abs/2505.09819

作者:Ruichen Yang,György M. Lévay,Christopher L. Hunt,Dániel Czeiner,Megan C. Hodgson,Damini Agarwal,Rahul R. Kaliki,Nitish V. Thakor

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词:upper limb myoelectric, limb myoelectric prostheses, upper limb, translate electromyography, EMG patterns

备注

点击查看摘要

Abstract:State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder's classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.

68. 【2505.09746】A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

链接https://arxiv.org/abs/2505.09746

作者:Xabier Morales,Ayah Elsayed,Debbie Zhao,Filip Loncaric,Ainhoa Aguado,Mireia Masias,Gina Quill,Marc Ramos,Ada Doltra,Ana Garcia,Marta Sitges,David Marlevi,Alistair Young,Martyn Nash,Bart Bijnens,Oscar Camara

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:left ventricular filling, modulating left ventricular, Flow MRI, conventional ultrasound analysis, Flow MRI make

备注

点击查看摘要

Abstract:The left atrium (LA) plays a pivotal role in modulating left ventricular filling, but our comprehension of its hemodynamics is significantly limited by the constraints of conventional ultrasound analysis. 4D flow magnetic resonance imaging (4D Flow MRI) holds promise for enhancing our understanding of atrial hemodynamics. However, the low velocities within the LA and the limited spatial resolution of 4D Flow MRI make analyzing this chamber challenging. Furthermore, the absence of dedicated computational frameworks, combined with diverse acquisition protocols and vendors, complicates gathering large cohorts for studying the prognostic value of hemodynamic parameters provided by 4D Flow MRI. In this study, we introduce the first open-source computational framework tailored for the analysis of 4D Flow MRI in the LA, enabling comprehensive qualitative and quantitative analysis of advanced hemodynamic parameters. Our framework proves robust to data from different centers of varying quality, producing high-accuracy automated segmentations (Dice $$ 0.9 and Hausdorff 95 $$ 3 mm), even with limited training data. Additionally, we conducted the first comprehensive assessment of energy, vorticity, and pressure parameters in the LA across a spectrum of disorders to investigate their potential as prognostic biomarkers.

69. 【2306.07615】UOD: Universal One-shot Detection of Anatomical Landmarks

链接https://arxiv.org/abs/2306.07615

作者:Heqin Zhu,Quan Quan,Qingsong Yao,Zaiyi Liu,S. Kevin Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieves great success, label-efficient training process, gains much attention, attention and achieves, achieves great

备注: Eealy accepted by MICCAI 2023. 11pages, 4 figures, 2 tables. arXiv admin note: text overlap with [arXiv:2203.06433](https://arxiv.org/abs/2203.06433)

点击查看摘要

Abstract:One-shot medical landmark detection gains much attention and achieves great success for its label-efficient training process. However, existing one-shot learning methods are highly specialized in a single domain and suffer domain preference heavily in the situation of multi-domain unlabeled data. Moreover, one-shot learning is not robust that it faces performance drop when annotating a sub-optimal image. To tackle these issues, we resort to developing a domain-adaptive one-shot landmark detection framework for handling multi-domain medical images, named Universal One-shot Detection (UOD). UOD consists of two stages and two corresponding universal models which are designed as combinations of domain-specific modules and domain-shared modules. In the first stage, a domain-adaptive convolution model is self-supervised learned to generate pseudo landmark labels. In the second stage, we design a domain-adaptive transformer to eliminate domain preference and build the global context for multi-domain data. Even though only one annotated sample from each domain is available for training, the domain-shared modules help UOD aggregate all one-shot samples to detect more robust and accurate landmarks. We investigated both qualitatively and quantitatively the proposed UOD on three widely-used public X-ray datasets in different anatomical domains (i.e., head, hand, chest) and obtained state-of-the-art performances in each domain. The code is available at this https URL.

70. 【2505.10492】Multi-contrast laser endoscopy for in vivo gastrointestinal imaging

链接https://arxiv.org/abs/2505.10492

作者:Taylor L. Bobrow,Mayank Golhar,Suchapa Arayakarnkul,Anthony A. Song,Saowanee Ngamruengphong,Nicholas J. Durr

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

关键词:clinical gold standard, gold standard, standard for detecting, detecting diseases, MLE

备注

点击查看摘要

Abstract:White light endoscopy is the clinical gold standard for detecting diseases in the gastrointestinal tract. Most applications involve identifying visual abnormalities in tissue color, texture, and shape. Unfortunately, the contrast of these features is often subtle, causing many clinically relevant cases to go undetected. To overcome this challenge, we introduce Multi-contrast Laser Endoscopy (MLE): a platform for widefield clinical imaging with rapidly tunable spectral, coherent, and directional illumination. We demonstrate three capabilities of MLE: enhancing tissue chromophore contrast with multispectral diffuse reflectance, quantifying blood flow using laser speckle contrast imaging, and characterizing mucosal topography using photometric stereo. We validate MLE with benchtop models, then demonstrate MLE in vivo during clinical colonoscopies. MLE images from 31 polyps demonstrate an approximate three-fold improvement in contrast and a five-fold improvement in color difference compared to white light and narrow band imaging. With the ability to reveal multiple complementary types of tissue contrast while seamlessly integrating into the clinical environment, MLE shows promise as an investigative tool to improve gastrointestinal imaging.

71. 【2505.10464】HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation

链接https://arxiv.org/abs/2505.10464

作者:Jiaming Liang,Lihuan Dai,Xiaoqi Sheng,Xiangguang Chen,Chun Yao,Guihua Tao,Qibin Leng,Honming Cai,Xi Zhong

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:faces significant challenges, segmentation faces significant, Multimodal medical image, gastric cancer lesion, cancer lesion analysis

备注: This work has been provisionally accepted for MICCAI 2025

点击查看摘要

Abstract:Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via this https URL.

72. 【2505.10405】Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding

链接https://arxiv.org/abs/2505.10405

作者:Jianhao Huang,Qunsong Zeng,Kaibin Huang

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:reduces communication costs, large artificial intelligence, transmitting low-dimensional prompts, Generative semantic communication, reduces communication

备注

点击查看摘要

Abstract:Generative semantic communication (Gen-SemCom) with large artificial intelligence (AI) model promises a transformative paradigm for 6G networks, which reduces communication costs by transmitting low-dimensional prompts rather than raw data. However, purely prompt-driven generation loses fine-grained visual details. Additionally, there is a lack of systematic metrics to evaluate the performance of Gen-SemCom systems. To address these issues, we develop a hybrid Gen-SemCom system with a critical information embedding (CIE) framework, where both text prompts and semantically critical features are extracted for transmissions. First, a novel approach of semantic filtering is proposed to select and transmit the semantically critical features of images relevant to semantic label. By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model. Next, we propose the generative visual information fidelity (GVIF) metric to evaluate the visual quality of the generated image. By characterizing the statistical models of image features, the GVIF metric quantifies the mutual information between the distorted features and their original counterparts. By maximizing the GVIF metric, we design a channel-adaptive Gen-SemCom system that adaptively control the volume of features and compression rate according to the channel state. Experimental results validate the GVIF metric's sensitivity to visual fidelity, correlating with both the PSNR and critical information volume. In addition, the optimized system achieves superior performance over benchmarking schemes in terms of higher PSNR and lower FID scores.

73. 【2505.09985】Ordered-subsets Multi-diffusion Model for Sparse-view CT Reconstruction

链接https://arxiv.org/abs/2505.09985

作者:Pengfei Yu,Bin Huang,Minghui Zhang,Weiwen Wu,Shaoyu Wang,Qiegen Liu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:shown significant promise, Score-based diffusion models, Score-based diffusion, shown significant, significant promise

备注

点击查看摘要

Abstract:Score-based diffusion models have shown significant promise in the field of sparse-view CT reconstruction. However, the projection dataset is large and riddled with redundancy. Consequently, applying the diffusion model to unprocessed data results in lower learning effectiveness and higher learning difficulty, frequently leading to reconstructed images that lack fine details. To address these issues, we propose the ordered-subsets multi-diffusion model (OSMM) for sparse-view CT reconstruction. The OSMM innovatively divides the CT projection data into equal subsets and employs multi-subsets diffusion model (MSDM) to learn from each subset independently. This targeted learning approach reduces complexity and enhances the reconstruction of fine details. Furthermore, the integration of one-whole diffusion model (OWDM) with complete sinogram data acts as a global information constraint, which can reduce the possibility of generating erroneous or inconsistent sinogram information. Moreover, the OSMM's unsupervised learning framework provides strong robustness and generalizability, adapting seamlessly to varying sparsity levels of CT sinograms. This ensures consistent and reliable performance across different clinical scenarios. Experimental results demonstrate that OSMM outperforms traditional diffusion models in terms of image quality and noise resilience, offering a powerful and versatile solution for advanced CT imaging in sparse-view scenarios.

74. 【2505.09831】ImplicitStainer: Data-Efficient Medical Image Translation for Virtual Antibody-based Tissue Staining Using Local Implicit Functions

链接https://arxiv.org/abs/2505.09831

作者:Tushar Kataria,Beatrice Knudsen,Shireen Y. Elhabian

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Hematoxylin and eosin, diagnosis in pathology, gold standard, standard for microscopic, microscopic diagnosis

备注

点击查看摘要

Abstract:Hematoxylin and eosin (HE) staining is a gold standard for microscopic diagnosis in pathology. However, HE staining does not capture all the diagnostic information that may be needed. To obtain additional molecular information, immunohistochemical (IHC) stains highlight proteins that mark specific cell types, such as CD3 for T-cells or CK8/18 for epithelial cells. While IHC stains are vital for prognosis and treatment guidance, they are typically only available at specialized centers and time consuming to acquire, leading to treatment delays for patients. Virtual staining, enabled by deep learning-based image translation models, provides a promising alternative by computationally generating IHC stains from HE stained images. Although many GAN and diffusion based image to image (I2I) translation methods have been used for virtual staining, these models treat image patches as independent data points, which results in increased and more diverse data requirements for effective generation. We present ImplicitStainer, a novel approach that leverages local implicit functions to improve image translation, specifically virtual staining performance, by focusing on pixel-level predictions. This method enhances robustness to variations in dataset sizes, delivering high-quality results even with limited data. We validate our approach on two datasets using a comprehensive set of metrics and benchmark it against over fifteen state-of-the-art GAN- and diffusion based models. Full Code and models trained will be released publicly via Github upon acceptance.

75. 【2505.09630】Generative diffusion model surrogates for mechanistic agent-based biological models

链接https://arxiv.org/abs/2505.09630

作者:Tien Comlekoglu,J. Quetzalcóatl Toledo-Marín,Douglas W. DeSimone,Shayn M. Peirce,Geoffrey Fox,James A. Glazier

类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF)

关键词:single-cell resolution, organism-scale biology, biology at single-cell, Mechanistic, models

备注

点击查看摘要

Abstract:Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate \textit{in vitro} vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.