本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新419篇论文,其中:
- 自然语言处理44篇
- 信息检索11篇
- 计算机视觉87篇
自然语言处理
1. 【2504.17768】he Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
链接:https://arxiv.org/abs/2504.17768
作者:Piotr Nawrot,Robert Li,Renjie Huang,Sebastian Ruder,Kelly Marchisio,Edoardo M. Ponti
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:studies remain unexplored, systematic scaling studies, scaling studies remain, extend long-context capabilities, Sparse attention
备注:
点击查看摘要
Abstract:Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.
2. 【2504.17753】Conversational Assistants to support Heart Failure Patients: comparing a Neurosymbolic Architecture with ChatGPT
链接:https://arxiv.org/abs/2504.17753
作者:Anuja Tayal,Devika Salunke,Barbara Di Eugenio,Paula Allen-Meares,Eulalia Puig Abril,Olga Garcia,Carolyn Dickens,Andrew Boyd
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, capabilities of Large, including in healthcare
备注:
点击查看摘要
Abstract:Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of Large Language Models. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.
3. 【2504.17720】Multilingual Performance Biases of Large Language Models in Education
链接:https://arxiv.org/abs/2504.17720
作者:Vansh Gupta,Sankalan Pal Chowdhury,Vilém Zouhar,Donya Rooein,Mrinmaya Sachan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, increasingly being adopted, remain primarily English-centric, Large
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
4. 【2504.17704】Safety in Large Reasoning Models: A Survey
链接:https://arxiv.org/abs/2504.17704
作者:Cheng Wang,Yue Liu,Baolong Li,Duzhen Zhang,Zhongzhi Li,Junfeng Fang
类目:Computation and Language (cs.CL)
关键词:Large Reasoning Models, advanced reasoning capabilities, exhibited extraordinary prowess, Large Reasoning, advanced reasoning
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
5. 【2504.17685】Ensemble Bayesian Inference: Leveraging Small Language Models to Achieve LLM-level Accuracy in Profile Matching Tasks
链接:https://arxiv.org/abs/2504.17685
作者:Haru-Tada Sato,Fuka Matsuzaki,Jun-ichiro Takahashi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:achieve accuracy comparable, proprietary large language, Ensemble Bayesian Inference, study explores, explores the potential
备注: 13 pages, 2 figures
点击查看摘要
Abstract:This study explores the potential of small language model(SLM) ensembles to achieve accuracy comparable to proprietary large language models (LLMs). We propose Ensemble Bayesian Inference (EBI), a novel approach that applies Bayesian estimation to combine judgments from multiple SLMs, allowing them to exceed the performance limitations of individual models. Our experiments on diverse tasks(aptitude assessments and consumer profile analysis in both Japanese and English) demonstrate EBI's effectiveness. Notably, we analyze cases where incorporating models with negative Lift values into ensembles improves overall performance, and we examine the method's efficacy across different languages. These findings suggest new possibilities for constructing high-performance AI systems with limited computational resources and for effectively utilizing models with individually lower performance. Building on existing research on LLM performance evaluation, ensemble methods, and open-source LLM utilization, we discuss the novelty and significance of our approach.
6. 【2504.17674】Energy Considerations of Large Language Model Inference and Efficiency Optimizations
链接:https://arxiv.org/abs/2504.17674
作者:Jared Fernandez,Clara Na,Vashisth Tiwari,Yonatan Bisk,Sasha Luccioni,Emma Strubell
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:environmental costs continue, Natural Language Processing, large language models, continue to rise, computational and environmental
备注: 16 pages
点击查看摘要
Abstract:As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.
7. 【2504.17671】Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction
链接:https://arxiv.org/abs/2504.17671
作者:Yuanchang Ye,Weiyan Wen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Visual Question Answering, Large Vision-Language Models, Question Answering, Visual Question, Vision-Language Models
备注:
点击查看摘要
Abstract:This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous control of \textbf{marginal coverage} to ensure empirical error rates remain strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all $\alpha$ values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
8. 【2504.17665】Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics
链接:https://arxiv.org/abs/2504.17665
作者:Zena Al-Khalili,Nick Howell,Dietrich Klakow
类目:Computation and Language (cs.CL)
关键词:Assisting LLMs, code generation improved, reasoning tasks, Assisting, math reasoning tasks
备注:
点击查看摘要
Abstract:Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs' generated programs in response to math reasoning tasks. Our evaluation focuses on the extent to which LLMs ground their programs to math rules, and how that affects their end performance. For this purpose, we assess the generations of five different LLMs, on two different math datasets, both manually and automatically. Our results reveal that the distribution of grounding depends on LLMs' capabilities and the difficulty of math problems. Furthermore, mathematical grounding is more effective for closed-source models, while open-source models fail to employ math rules in their solutions correctly. On MATH500, the percentage of grounded programs decreased to half, while the ungrounded generations doubled in comparison to ASDiv grade-school problems. Our work highlights the need for in-depth evaluation beyond execution accuracy metrics, toward a better understanding of code-assisted LLMs' capabilities and limits in the math domain.
9. 【2504.17653】owards a comprehensive taxonomy of online abusive language informed by machine leaning
链接:https://arxiv.org/abs/2504.17653
作者:Samaneh Hosseini Moghaddam,Kelly Lyons,Cheryl Regehr,Vivek Goel,Kaitlyn Regehr
类目:Computation and Language (cs.CL)
关键词:posed significant risks, individuals and communities, communications has posed, posed significant, significant risks
备注:
点击查看摘要
Abstract:The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
10. 【2504.17574】RAGAT-Mind: A Multi-Granular Modeling Approach for Rumor Detection Based on MindSpore
链接:https://arxiv.org/abs/2504.17574
作者:Zhenkai Qin,Guifang Yang,Dongze Wu
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:social media platforms, natural language processing, false information continues, effective rumor detection, Chinese rumor detection
备注:
点击查看摘要
Abstract:As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.
11. 【2504.17565】DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training
链接:https://arxiv.org/abs/2504.17565
作者:Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Yunjie Ji,Han Zhao,Xiangang Li
类目:Computation and Language (cs.CL)
关键词:recently achieved remarkable, large language models, achieved remarkable performance, complex reasoning benchmarks, large language
备注:
点击查看摘要
Abstract:Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: this https URL
12. 【2504.17562】When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
链接:https://arxiv.org/abs/2504.17562
作者:Rei Higuchi,Ryotaro Kawata,Naoki Nishikawa,Kazusato Oko,Shoichiro Yamaguchi,Sosuke Kobayashi,Seiya Tokui,Kohei Hayashi,Daisuke Okanohara,Taiji Suzuki
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:acquire latent semantics, latent semantics, key properties, properties that determines, model performance
备注:
点击查看摘要
Abstract:The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.
13. 【2504.17550】HalluLens: LLM Hallucination Benchmark
链接:https://arxiv.org/abs/2504.17550
作者:Yejin Bang,Ziwei Ji,Alan Schelten,Anthony Hartshorn,Tara Fowler,Cheng Zhang,Nicola Cancedda,Pascale Fung
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, language models, generate responses, responses that deviate
备注: 42 pages
点击查看摘要
Abstract:Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination." These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from "factuality," proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.
14. 【2504.17480】Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation
链接:https://arxiv.org/abs/2504.17480
作者:Xin Yi,Shunfan Zhengc,Linlin Wanga,Xiaoling Wang,Liang He
类目:Computation and Language (cs.CL)
关键词:protecting intellectual property, unauthorized knowledge distillation, knowledge distillation, large language models, unauthorized knowledge
备注:
点击查看摘要
Abstract:Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
15. 【2504.17449】HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models
链接:https://arxiv.org/abs/2504.17449
作者:Jun Zhang,Jue Wang,Huan Li,Lidan Shou,Ke Chen,Gang Chen,Qin Xie,Guiming Xie,Xuejian Gong
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:require dedicated hardware, significant computational demands, pretrained language models, multi-tenant environments, dedicated hardware
备注: Accepted by VLDBJ 2025
点击查看摘要
Abstract:The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM knowledge into general, domain-specific, and task-specific. Leveraging insights on knowledge acquisition across different model layers, we construct hierarchical PLMs (hPLMs) by extracting and storing knowledge at different levels, significantly reducing GPU memory usage per tenant. Secondly, we establish hierarchical knowledge management for hPLMs generated by various tenants in HMI. We manage domain-specific knowledge with acceptable storage increases by constructing and updating domain-specific knowledge trees based on frequency. We manage task-specific knowledge within limited GPU memory through parameter swapping. Finally, we propose system optimizations to enhance resource utilization and inference throughput. These include fine-grained pipelining via hierarchical knowledge prefetching to overlap CPU and I/O operations with GPU computations, and optimizing parallel implementations with batched matrix multiplications. Our experimental results demonstrate that the proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a single GPU, with only a negligible compromise in accuracy.
16. 【2504.17445】Creating Targeted, Interpretable Topic Models with LLM-Generated Text Augmentation
链接:https://arxiv.org/abs/2504.17445
作者:Anna Lieb,Maneesh Arora,Eni Mustafaraj
类目:Computation and Language (cs.CL)
关键词:Unsupervised machine learning, machine learning techniques, identify latent patterns, Unsupervised machine, unstructured text data
备注: Presented at IC2S2 2024 in Philadelphia, USA
点击查看摘要
Abstract:Unsupervised machine learning techniques, such as topic modeling and clustering, are often used to identify latent patterns in unstructured text data in fields such as political science and sociology. These methods overcome common concerns about reproducibility and costliness involved in the labor-intensive process of human qualitative analysis. However, two major limitations of topic models are their interpretability and their practicality for answering targeted, domain-specific social science research questions. In this work, we investigate opportunities for using LLM-generated text augmentation to improve the usefulness of topic modeling output. We use a political science case study to evaluate our results in a domain-specific application, and find that topic modeling using GPT-4 augmentations creates highly interpretable categories that can be used to investigate domain-specific research questions with minimal human guidance.
17. 【2504.17390】PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona
链接:https://arxiv.org/abs/2504.17390
作者:Jihyun Lee,Yejin Jeon,Seungyeon Seo,Gary Geunbae Lee
类目:Computation and Language (cs.CL)
关键词:users' personal attributes, natural language interactions, fulfill user requests, produce generic, personal attributes
备注: Accepted in NAACL 2025 main
点击查看摘要
Abstract:Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains this https URL.
18. 【2504.17366】LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams
链接:https://arxiv.org/abs/2504.17366
作者:Yongxuan Wu,Runyu Chen,Peiyu Liu,Hongjin Qian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:uneven information density, poses significant challenges, natural language processing, real-world dialogues characterized, understanding poses significant
备注:
点击查看摘要
Abstract:Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at this https URL.
19. 【2504.17365】meSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
链接:https://arxiv.org/abs/2504.17365
作者:Ling You,Wenxuan Huang,Xinni Xie,Xiangyi Wei,Bangyan Li,Shaohui Lin,Yang Li,Changbo Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:popular sporting event, distinctive highlight moments, globally popular sporting, Multimodal Large Language, Large Language Models
备注:
点击查看摘要
Abstract:Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.
20. 【2504.17360】PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare
链接:https://arxiv.org/abs/2504.17360
作者:Jose G. Moreno(IRIT-IRIS),Jesus Lovon(IRIT-IRIS),M'Rick Robin-Charlet(UT3),Christine Damase-Michel,Lynda Tamine(IRIT-IRIS)
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, default practice, practice for improving
备注:
点击查看摘要
Abstract:Fine-tuning of Large Language Models (LLMs) has become the default practice for improving model performance on a given task. However, performance improvement comes at the cost of training on vast amounts of annotated data which could be sensitive leading to significant data privacy concerns. In particular, the healthcare domain is one of the most sensitive domains exposed to data privacy issues. In this paper, we present PatientDx, a framework of model merging that allows the design of effective LLMs for health-predictive tasks without requiring fine-tuning nor adaptation on patient data. Our proposal is based on recently proposed techniques known as merging of LLMs and aims to optimize a building block merging strategy. PatientDx uses a pivotal model adapted to numerical reasoning and tunes hyperparameters on examples based on a performance metric but without training of the LLM on these data. Experiments using the mortality tasks of the MIMIC-IV dataset show improvements up to 7% in terms of AUROC when compared to initial models. Additionally, we confirm that when compared to fine-tuned models, our proposal is less prone to data leak problems without hurting performance. Finally, we qualitatively show the capabilities of our proposal through a case study. Our best model is publicly available at this https URL Jgmorenof/mistral\_merged\_0\_4.
21. 【2504.17353】M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction
链接:https://arxiv.org/abs/2504.17353
作者:Chengguang Gan,Sunbowen Lee,Zhixi Cai,Yanbin Wei,Lei Zheng,Yunhao Liang,Shiwen Ni,Tatsunori Mori
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Mutual Reinforcement Effect, Reinforcement Effect, Mutual Reinforcement, MRE, Multimodal Mutual Reinforcement
备注:
点击查看摘要
Abstract:Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.
22. 【2504.17332】Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection
链接:https://arxiv.org/abs/2504.17332
作者:Zihan Wang,Lu Yuan,Zhengxuan Zhang,Qing Zhao
类目:Computation and Language (cs.CL)
关键词:digital era, social media, information dissemination, major conduit, conduit for information
备注:
点击查看摘要
Abstract:In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integrates cognitive and emotional empathy to analyze misinformation from both the creator and reader perspectives. By examining creators' cognitive strategies and emotional appeals, as well as simulating readers' cognitive judgments and emotional responses using Large Language Models (LLMs), DAE offers a more comprehensive and human-centric approach to misinformation detection. Moreover, we further introduce an empathy-aware filtering mechanism to enhance response authenticity and diversity. Experimental results on benchmark datasets demonstrate that DAE outperforms existing methods, providing a novel paradigm for multimodal misinformation detection.
23. 【2504.17311】FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
链接:https://arxiv.org/abs/2504.17311
作者:Yulia Otmakhova,Hung Thinh Truong,Rahmad Mahendra,Zenan Zhai,Rongxin Zhu,Daniel Beck,Jey Han Lau
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:tasK-agnostic robustness Evaluation, task-agnostic framework, Framework for LingUistically-driven, framework for assessing, robustness Evaluation
备注:
点击查看摘要
Abstract:We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels - from orthography to dialect and style varieties - and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) while LLMs have better overall robustness compared to fine-tuned models, they still exhibit significant brittleness to certain linguistic variations; (3) all models show substantial vulnerability to negation modifications across most tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
24. 【2504.17309】CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality
链接:https://arxiv.org/abs/2504.17309
作者:Junyan Zhang,Shuliang Liu,Aiwei Liu,Yubo Gao,Jungang Li,Xiaojie Gu,Xuming Hu
类目:Computation and Language (cs.CL)
关键词:large language models, language models, Sentence-level watermarking, trace the usage, usage of content
备注: Published at the 1st workshop on GenAI Watermarking, collocated with ICLR 2025
点击查看摘要
Abstract:Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.
25. 【2504.17279】Evaluating and Mitigating Bias in AI-Based Medical Text Generation
链接:https://arxiv.org/abs/2504.17279
作者:Xiuying Chen,Tairan Wang,Juexiao Zhou,Zirui Song,Xin Gao,Xiangliang Zhang
类目:Computation and Language (cs.CL)
关键词:increasingly achieved expert-level, Artificial intelligence, achieved expert-level performance, text generation, increasingly achieved
备注: 12 pages, 8 figures, published in Nature Computational Science
点击查看摘要
Abstract:Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain. Our code is publicly available to facilitate further research at this https URL.
Comments:
12 pages, 8 figures, published in Nature Computational Science
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2504.17279 [cs.CL]
(or
arXiv:2504.17279v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2504.17279
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Nature Computational Science 2025
26. 【2504.17264】JurisCTC: Enhancing Legal Judgment Prediction via Cross-Domain Transfer and Contrastive Learning
链接:https://arxiv.org/abs/2504.17264
作者:Zhaolu Kang,Hongtian Cai,Xiangyang Ji,Jinzhe Li,Nanfei Gu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Unsupervised Domain Adaptation, Natural Language Processing, gained significant attention, field of Natural, Language Processing
备注: Accepted in International Joint Conference on Neural Networks (IJCNN) 2025
点击查看摘要
Abstract:In recent years, Unsupervised Domain Adaptation (UDA) has gained significant attention in the field of Natural Language Processing (NLP) owing to its ability to enhance model generalization across diverse domains. However, its application for knowledge transfer between distinct legal domains remains largely unexplored. To address the challenges posed by lengthy and complex legal texts and the limited availability of large-scale annotated datasets, we propose JurisCTC, a novel model designed to improve the accuracy of Legal Judgment Prediction (LJP) tasks. Unlike existing approaches, JurisCTC facilitates effective knowledge transfer across various legal domains and employs contrastive learning to distinguish samples from different domains. Specifically, for the LJP task, we enable knowledge transfer between civil and criminal law domains. Compared to other models and specific large language models (LLMs), JurisCTC demonstrates notable advancements, achieving peak accuracies of 76.59% and 78.83%, respectively.
27. 【2504.17252】Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-Igbo
链接:https://arxiv.org/abs/2504.17252
作者:Ocheme Anthony Ekle,Biswarup Das
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:develop Neural Machine, West Africa, Nigeria and West, African language spoken, Neural Machine Translation
备注: 25 pages, 14 combined figures (19 total), includes horizontal layouts. Submitted to arXiv for open access
点击查看摘要
Abstract:In this study, we develop Neural Machine Translation (NMT) and Transformer-based transfer learning models for English-to-Igbo translation - a low-resource African language spoken by over 40 million people across Nigeria and West Africa. Our models are trained on a curated and benchmarked dataset compiled from Bible corpora, local news, Wikipedia articles, and Common Crawl, all verified by native language experts. We leverage Recurrent Neural Network (RNN) architectures, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), enhanced with attention mechanisms to improve translation accuracy. To further enhance performance, we apply transfer learning using MarianNMT pre-trained models within the SimpleTransformers framework. Our RNN-based system achieves competitive results, closely matching existing English-Igbo benchmarks. With transfer learning, we observe a performance gain of +4.83 BLEU points, reaching an estimated translation accuracy of 70%. These findings highlight the effectiveness of combining RNNs with transfer learning to address the performance gap in low-resource language translation tasks.
28. 【2504.17238】Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive Dialogues
链接:https://arxiv.org/abs/2504.17238
作者:Jinfeng Zhou,Yuxuan Chen,Jianing Yin,Yongkang Huang,Yihan Shi,Xikun Zhang,Libiao Peng,Rongsheng Zhang,Tangjie Lv,Zhipeng Hu,Hongning Wang,Minlie Huang
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:mental health challenges, psychotherapeutic process aimed, Cognitive Restructuring, individual negative thoughts, arising from mental
备注:
点击查看摘要
Abstract:Cognitive Restructuring (CR) is a psychotherapeutic process aimed at identifying and restructuring an individual's negative thoughts, arising from mental health challenges, into more helpful and positive ones via multi-turn dialogues. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, existing efforts implement CR via simple text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to align with the psychotherapeutic process for effective CR. To address this gap, we propose CRDial, a novel framework for CR, which creates multi-turn dialogues with specifically designed identification and restructuring stages of negative thoughts, integrates sentence-level supportive conversation strategies, and adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.
29. 【2504.17220】Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?
链接:https://arxiv.org/abs/2504.17220
作者:Kaidong Feng,Zhu Sun,Jie Yang,Hui Fang,Xinghua Qu,Wenyuan Liu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:bundle generation, bundle generation performance, increasingly explored, reasoning capabilities, knowledge
备注:
点击查看摘要
Abstract:LLMs are increasingly explored for bundle generation, thanks to their reasoning capabilities and knowledge. However, deploying large-scale LLMs introduces significant efficiency challenges, primarily high computational costs during fine-tuning and inference due to their massive parameterization. Knowledge distillation (KD) offers a promising solution, transferring expertise from large teacher models to compact student models. This study systematically investigates knowledge distillation approaches for bundle generation, aiming to minimize computational demands while preserving performance. We explore three critical research questions: (1) how does the format of KD impact bundle generation performance? (2) to what extent does the quantity of distilled knowledge influence performance? and (3) how do different ways of utilizing the distilled knowledge affect performance? We propose a comprehensive KD framework that (i) progressively extracts knowledge (patterns, rules, deep thoughts); (ii) captures varying quantities of distilled knowledge through different strategies; and (iii) exploits complementary LLM adaptation techniques (in-context learning, supervised fine-tuning, combination) to leverage distilled knowledge in small student models for domain-specific adaptation and enhanced efficiency. Extensive experiments provide valuable insights into how knowledge format, quantity, and utilization methodologies collectively shape LLM-based bundle generation performance, exhibiting KD's significant potential for more efficient yet effective LLM-based bundle generation.
30. 【2504.17200】A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
链接:https://arxiv.org/abs/2504.17200
作者:Yangxinyu Xie,Bowen Jiang,Tanwi Mallick,Joshua David Bergerson,John K. Hutchison,Duane R. Verner,Jordan Branham,M. Ross Alexander,Robert B. Ross,Yan Feng,Leslie-Anne Levy,Weijie Su,Camillo J. Taylor
类目:Computation and Language (cs.CL)
关键词:Large language models, addressing pressing societal, pressing societal challenges, Large language, transformational capability
备注:
点击查看摘要
Abstract:Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision-makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context-specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval-augmented generation (RAG)-based multi-agent LLM system to support analysis and decision-making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user-centered, multi-agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert-led case studies demonstrates that WildfireGPT significantly outperforms existing LLM-based solutions for decision support.
31. 【2504.17192】Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
链接:https://arxiv.org/abs/2504.17192
作者:Minju Seo,Jinheon Baek,Seongyun Lee,Sung Ju Hwang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, recent Large Language, machine learning research, making it slow, prior work
备注:
点击查看摘要
Abstract:Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.
32. 【2504.17137】MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation
链接:https://arxiv.org/abs/2504.17137
作者:Chanhee Park,Hyeonseok Moon,Chanjun Park,Heuiseok Lim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, capabilities of Large, Language Models, external knowledge
备注: Accepted to NAACL2025 Findings
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at this https URL.
33. 【2504.17130】Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
链接:https://arxiv.org/abs/2504.17130
作者:Hannah Cyberey,David Evans
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
关键词:Large language models, Large language, access information, Large, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector
34. 【2504.17119】he Rise of Small Language Models in Healthcare: A Comprehensive Survey
链接:https://arxiv.org/abs/2504.17119
作者:Muskan Garg,Shaina Raza,Shebuti Rayana,Xingyi Liu,Sunghwan Sohn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:clinically viable solution, next-generation healthcare informatics, large language models, small language models, healthcare applications driven
备注: 35 pages, 7 tables, 5 figures
点击查看摘要
Abstract:Despite substantial progress in healthcare applications driven by large language models (LLMs), growing concerns around data privacy, and limited resources; the small language models (SLMs) offer a scalable and clinically viable solution for efficient performance in resource-constrained environments for next-generation healthcare informatics. Our comprehensive survey presents a taxonomic framework to identify and categorize them for healthcare professionals and informaticians. The timeline of healthcare SLM contributions establishes a foundational framework for analyzing models across three dimensions: NLP tasks, stakeholder roles, and the continuum of care. We present a taxonomic framework to identify the architectural foundations for building models from scratch; adapting SLMs to clinical precision through prompting, instruction fine-tuning, and reasoning; and accessibility and sustainability through compression techniques. Our primary objective is to offer a comprehensive survey for healthcare professionals, introducing recent innovations in model optimization and equipping them with curated resources to support future research and development in the field. Aiming to showcase the groundbreaking advancements in SLMs for healthcare, we present a comprehensive compilation of experimental results across widely studied NLP tasks in healthcare to highlight the transformative potential of SLMs in healthcare. The updated repository is available at Github
35. 【2504.17091】Co-CoT: A Prompt-Based Framework for Collaborative Chain-of-Thought Reasoning
链接:https://arxiv.org/abs/2504.17091
作者:Seunghyun Yoo
类目:Computation and Language (cs.CL)
关键词:users' critical thinking, undermining users' critical, reflective thinking, opportunities for deep, significantly diminished
备注: 5 page
点击查看摘要
Abstract:Due to the proliferation of short-form content and the rapid adoption of AI, opportunities for deep, reflective thinking have significantly diminished, undermining users' critical thinking and reducing engagement with the reasoning behind AI-generated outputs. To address this issue, we propose an Interactive Chain-of-Thought (CoT) Framework that enhances human-centered explainability and responsible AI usage by making the model's inference process transparent, modular, and user-editable. The framework decomposes reasoning into clearly defined blocks that users can inspect, modify, and re-execute, encouraging active cognitive engagement rather than passive consumption. It further integrates a lightweight edit-adaptation mechanism inspired by preference learning, allowing the system to align with diverse cognitive styles and user intentions. Ethical transparency is ensured through explicit metadata disclosure, built-in bias checkpoint functionality, and privacy-preserving safeguards. This work outlines the design principles and architecture necessary to promote critical engagement, responsible interaction, and inclusive adaptation in AI systems aimed at addressing complex societal challenges.
36. 【2504.17083】How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study
链接:https://arxiv.org/abs/2504.17083
作者:Rendi Chevi,Kentaro Inui,Thamar Solorio,Alham Fikri Aji
类目:Computation and Language (cs.CL)
关键词:LLM responses, LLM, inaccurate LLM responses, LLM language style, makes an interaction
备注: Accepted at GenAICHI 2025 @ ACM CHI 2025
点击查看摘要
Abstract:What makes an interaction with the LLM more preferable for the user? While it is intuitive to assume that information accuracy in the LLM's responses would be one of the influential variables, recent studies have found that inaccurate LLM's responses could still be preferable when they are perceived to be more authoritative, certain, well-articulated, or simply verbose. These variables interestingly fall under the broader category of language style, implying that the style in the LLM's responses might meaningfully influence users' preferences. This hypothesized dynamic could have double-edged consequences: enhancing the overall user experience while simultaneously increasing their susceptibility to risks such as LLM's misinformation or hallucinations. In this short paper, we present our preliminary studies in exploring this subject. Through a series of exploratory and experimental user studies, we found that LLM's language style does indeed influence user's preferences, but how and which language styles influence the preference varied across different user populations, and more interestingly, moderated by the user's very own individual traits. As a preliminary work, the findings in our studies should be interpreted with caution, particularly given the limitations in our samples, which still need wider demographic diversity and larger sample sizes. Our future directions will first aim to address these limitations, which would enable a more comprehensive joint effect analysis between the language style, individual traits, and preferences, and further investigate the potential causal relationship between and beyond these variables.
37. 【2504.17075】Agree to Disagree? A Meta-Evaluation of LLM Misgendering
链接:https://arxiv.org/abs/2504.17075
作者:Arjun Subramonian,Vagrant Gautam,Preethi Seshadri,Dietrich Klakow,Kai-Wei Chang,Yizhou Sun
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:including probability-based evaluations, Numerous methods, measure LLM misgendering, including probability-based, templatic sentences
备注: Work in progress
点击查看摘要
Abstract:Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.
38. 【2504.17052】Do Words Reflect Beliefs? Evaluating Belief Depth in Large Language Models
链接:https://arxiv.org/abs/2504.17052
作者:Shariar Kabir,Kevin Esterling,Yue Dong
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, shaping political discourse, increasingly shaping political
备注: 20 pages, 9 figures
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly shaping political discourse, yet their responses often display inconsistency when subjected to scrutiny. While prior research has primarily categorized LLM outputs as left- or right-leaning to assess their political stances, a critical question remains: Do these responses reflect genuine internal beliefs or merely surface-level alignment with training data? To address this, we propose a novel framework for evaluating belief depth by analyzing (1) argumentative consistency and (2) uncertainty quantification. We evaluate 12 LLMs on 19 economic policies from the Political Compass Test, challenging their belief stability with both supportive and opposing arguments. Our analysis reveals that LLMs exhibit topic-specific belief stability rather than a uniform ideological stance. Notably, up to 95% of left-leaning models' responses and 89% of right-leaning models' responses remain consistent under the challenge, enabling semantic entropy to achieve high accuracy (AUROC=0.78), effectively distinguishing between surface-level alignment from genuine belief. These findings call into question the assumption that LLMs maintain stable, human-like political ideologies, emphasizing the importance of conducting topic-specific reliability assessments for real-world applications.
39. 【2504.17038】SCALAR: A Part-of-speech Tagger for Identifiers
链接:https://arxiv.org/abs/2504.17038
作者:Christian D. Newman,Brandon Scholten,Sophia Testa,Joshua A. C. Behler,Syreen Banabilah,Michael L. Collard,Michael J. Decker,Mohamed Wiem Mkaouer,Marcos Zampieri,Eman Abdullah AlOmar,Reem Alsuhaibani,Anthony Peruma,Jonathan I. Maletic
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Lexical Annotation Runtime, Source Code Analysis, Annotation Runtime, Analysis and Lexical, Lexical Annotation
备注:
点击查看摘要
Abstract:The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github
40. 【2504.17025】Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
链接:https://arxiv.org/abs/2504.17025
作者:Luca Moroni,Giovanni Puccetti,Pere-Lluis Huguet Cabot,Andrei Stefan Bejgu,Edoardo Barba,Alessio Miaschi,Felice Dell'Orletta,Andrea Esuli,Roberto Navigli
类目:Computation and Language (cs.CL)
关键词:pretrained Large Language, Large Language Models, pretrained Large, Large Language, increasing steadily
备注:
点击查看摘要
Abstract:The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
41. 【2504.17004】(Im)possibility of Automated Hallucination Detection in Large Language Models
链接:https://arxiv.org/abs/2504.17004
作者:Amin Karbasi,Omar Montasser,John Sous,Grigoris Velegkas
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:hallucination detection, language, language identification, hallucination, detection
备注:
点击查看摘要
Abstract:Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to language generation by Kleinberg and Mullainathan, we investigate whether an algorithm, trained on examples drawn from an unknown target language $K$ (selected from a countable collection) and given access to an LLM, can reliably determine whether the LLM's outputs are correct or constitute hallucinations. First, we establish an equivalence between hallucination detection and the classical task of language identification. We prove that any hallucination detection method can be converted into a language identification method, and conversely, algorithms solving language identification can be adapted for hallucination detection. Given the inherent difficulty of language identification, this implies that hallucination detection is fundamentally impossible for most language collections if the detector is trained using only correct examples from the target language. Second, we show that the use of expert-labeled feedback, i.e., training the detector with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements), dramatically changes this conclusion. Under this enriched training regime, automated hallucination detection becomes possible for all countable language collections. These results highlight the essential role of expert-labeled examples in training hallucination detectors and provide theoretical support for feedback-based methods, such as reinforcement learning with human feedback (RLHF), which have proven critical for reliable LLM deployment.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as:
arXiv:2504.17004 [cs.LG]
(or
arXiv:2504.17004v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2504.17004
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Grigoris Velegkas [view email] [v1]
Wed, 23 Apr 2025 18:00:07 UTC (126 KB)
42. 【2504.16977】okenization Matters: Improving Zero-Shot NER for Indic Languages
链接:https://arxiv.org/abs/2504.16977
作者:Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Amit Agarwal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, low resource Indic, resource Indic languages, resource Indic, Byte Pair Encoding
备注:
点击查看摘要
Abstract:Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer. Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2504.16977 [cs.CL]
(or
arXiv:2504.16977v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2504.16977
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
43. 【2504.16956】Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity
链接:https://arxiv.org/abs/2504.16956
作者:Cong Qi,Hanzhang Fang,Tianxing Hu,Siqi Jiang,Wei Zhi
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)
关键词:Single-cell RNA sequencing, RNA sequencing, major computational challenges, poses major computational, enables high-resolution analysis
备注:
点击查看摘要
Abstract:Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
44. 【2504.16939】A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
链接:https://arxiv.org/abs/2504.16939
作者:Emre Can Acikgoz,Cheng Qian,Hongru Wang,Vardhan Dongre,Xiusi Chen,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, sophisticated agents capable, traditional dialogue systems
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: this https URL.
信息检索
1. 【2504.17699】Quadratic Interest Network for Multimodal Click-Through Rate Prediction
链接:https://arxiv.org/abs/2504.17699
作者:Honghao Li,Hanwei Li,Jing Zhang,Yi Zhang,Ziniu Yu,Lei Sang,Yiwen Zhang
类目:Information Retrieval (cs.IR)
关键词:Multimodal CTR Prediction, Multimodal click-through rate, industrial recommender systems, CTR Prediction, CTR Prediction Challenge
备注:
点击查看摘要
Abstract:Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at this https URL.
2. 【2504.17547】A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
链接:https://arxiv.org/abs/2504.17547
作者:Jiaqi Deng,Zonghan Wu,Huan Huo,Guandong Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:Vision Question Answering, Knowledge-based Vision Question, general Vision Question, Question Answering, Vision Question
备注: 20 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.
3. 【2504.17529】IRA: Adaptive Interest-aware Representation and Alignment for Personalized Multi-interest Retrieval
链接:https://arxiv.org/abs/2504.17529
作者:Youngjune Lee,Haeyu Jeong,Changgeon Lim,Jeong Choi,Hongjun Lim,Hangon Kim,Jiyoon Kwon,Saehun Kim
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:require dynamic personalized, Online community platforms, platforms require dynamic, dynamic personalized retrieval, Online community
备注: Accepted to SIGIR 2025 Industry Track. First two authors contributed equally
点击查看摘要
Abstract:Online community platforms require dynamic personalized retrieval and recommendation that can continuously adapt to evolving user interests and new documents. However, optimizing models to handle such changes in real-time remains a major challenge in large-scale industrial settings. To address this, we propose the Interest-aware Representation and Alignment (IRA) framework, an efficient and scalable approach that dynamically adapts to new interactions through a cumulative structure. IRA leverages two key mechanisms: (1) Interest Units that capture diverse user interests as contextual texts, while reinforcing or fading over time through cumulative updates, and (2) a retrieval process that measures the relevance between Interest Units and documents based solely on semantic relationships, eliminating dependence on click signals to mitigate temporal biases. By integrating cumulative Interest Unit updates with the retrieval process, IRA continuously adapts to evolving user preferences, ensuring robust and fine-grained personalization without being constrained by past training distributions. We validate the effectiveness of IRA through extensive experiments on real-world datasets, including its deployment in the Home Section of NAVER's CAFE, South Korea's leading community platform.
4. 【2504.17519】Replication and Exploration of Generative Retrieval over Dynamic Corpora
链接:https://arxiv.org/abs/2504.17519
作者:Zhen Zhang,Xinyu Ma,Weiwei Sun,Pengjie Ren,Zhumin Chen,Shuaiqiang Wang,Dawei Yin,Maarten de Rijke,Zhaochun Ren
类目:Information Retrieval (cs.IR)
关键词:dynamic corpora, Generative retrieval, dynamic, promising paradigm, paradigm in information
备注: Accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)
点击查看摘要
Abstract:Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.
5. 【2504.17454】Adaptive Orchestration of Modular Generative Information Access Systems
链接:https://arxiv.org/abs/2504.17454
作者:Mohanna Hoveyda,Harrie Oosterhuis,Arjen P. de Vries,Maarten de Rijke,Faegheh Hasibi
类目:Information Retrieval (cs.IR)
关键词:large language models, Advancements in large, large language, driven the emergence, emergence of complex
备注: Accepted at SIGIR 2025 Perspective Paper Track
点击查看摘要
Abstract:Advancements in large language models (LLMs) have driven the emergence of complex new systems to provide access to information, that we will collectively refer to as modular generative information access (GenIA) systems. They integrate a broad and evolving range of specialized components, including LLMs, retrieval models, and a heterogeneous set of sources and tools. While modularity offers flexibility, it also raises critical challenges: How can we systematically characterize the space of possible modules and their interactions? How can we automate and optimize interactions among these heterogeneous components? And, how do we enable this modular system to dynamically adapt to varying user query requirements and evolving module capabilities? In this perspective paper, we argue that the architecture of future modular generative information access systems will not just assemble powerful components, but enable a self-organizing system through real-time adaptive orchestration -- where components' interactions are dynamically configured for each user input, maximizing information relevance while minimizing computational overhead. We give provisional answers to the questions raised above with a roadmap that depicts the key principles and methods for designing such an adaptive modular system. We identify pressing challenges, and propose avenues for addressing them in the years ahead. This perspective urges the IR community to rethink modular system designs for developing adaptive, self-optimizing, and future-ready architectures that evolve alongside their rapidly advancing underlying technologies.
6. 【2504.17427】Beyond Whole Dialogue Modeling: Contextual Disentanglement for Conversational Recommendation
链接:https://arxiv.org/abs/2504.17427
作者:Guojia An,Jie Zou,Jiwei Wei,Chaoning Zhang,Fuming Sun,Yang Yang
类目:Information Retrieval (cs.IR)
关键词:provide personalized recommendations, aim to provide, provide personalized, analyzing and utilizing, recommender systems aim
备注:
点击查看摘要
Abstract:Conversational recommender systems aim to provide personalized recommendations by analyzing and utilizing contextual information related to dialogue. However, existing methods typically model the dialogue context as a whole, neglecting the inherent complexity and entanglement within the dialogue. Specifically, a dialogue comprises both focus information and background information, which mutually influence each other. Current methods tend to model these two types of information mixedly, leading to misinterpretation of users' actual needs, thereby lowering the accuracy of recommendations. To address this issue, this paper proposes a novel model to introduce contextual disentanglement for improving conversational recommender systems, named DisenCRS. The proposed model DisenCRS employs a dual disentanglement framework, including self-supervised contrastive disentanglement and counterfactual inference disentanglement, to effectively distinguish focus information and background information from the dialogue context under unsupervised conditions. Moreover, we design an adaptive prompt learning module to automatically select the most suitable prompt based on the specific dialogue context, fully leveraging the power of large language models. Experimental results on two widely used public datasets demonstrate that DisenCRS significantly outperforms existing conversational recommendation models, achieving superior performance on both item recommendation and response generation tasks.
7. 【2504.17349】DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition
链接:https://arxiv.org/abs/2504.17349
作者:Yiyan Xu,Wuqiang Zheng,Wenjie Wang,Fengbin Zhu,Xinting Hu,Yang Zhang,Fuli Feng,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:multimodal content creation, Personalized image generation, Large Multimodal Models, content creation, promising direction
备注:
点击查看摘要
Abstract:Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Cite as:
arXiv:2504.17349 [cs.CV]
(or
arXiv:2504.17349v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2504.17349
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2504.17334】DataScout: Automatic Data Fact Retrieval for Statement Augmentation with an LLM-Based Agent
链接:https://arxiv.org/abs/2504.17334
作者:Chuer Chen,Yuqi Liu,Danqing Shi,Shixiong Cao,Nan Cao
类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:story typically integrates, data story typically, typically integrates data, integrates data facts, objective narrative
备注:
点击查看摘要
Abstract:A data story typically integrates data facts from multiple perspectives and stances to construct a comprehensive and objective narrative. However, retrieving these facts demands time for data search and challenges the creator's analytical skills. In this work, we introduce DataScout, an interactive system that automatically performs reasoning and stance-based data facts retrieval to augment the user's statement. Particularly, DataScout leverages an LLM-based agent to construct a retrieval tree, enabling collaborative control of its expansion between users and the agent. The interface visualizes the retrieval tree as a mind map that eases users to intuitively steer the retrieval direction and effectively engage in reasoning and analysis. We evaluate the proposed system through case studies and in-depth expert interviews. Our evaluation demonstrates that DataScout can effectively retrieve multifaceted data facts from different stances, helping users verify their statements and enhance the credibility of their stories.
9. 【2504.17304】You Are What You Bought: Generating Customer Personas for E-commerce Applications
链接:https://arxiv.org/abs/2504.17304
作者:Yimin Shi,Yang Fei,Shiqi Zhang,Haixun Wang,Xiaokui Xiao
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:customer, customer persona, GPLR, DOI, Focus to learn
备注: SIGIR 2025
点击查看摘要
Abstract:In e-commerce, user representations are essential for various applications. Existing methods often use deep learning techniques to convert customer behaviors into implicit embeddings. However, these embeddings are difficult to understand and integrate with external knowledge, limiting the effectiveness of applications such as customer segmentation, search navigation, and product recommendations. To address this, our paper introduces the concept of the customer persona. Condensed from a customer's numerous purchasing histories, a customer persona provides a multi-faceted and human-readable characterization of specific purchase behaviors and preferences, such as Busy Parents or Bargain Hunters. This work then focuses on representing each customer by multiple personas from a predefined set, achieving readable and informative explicit user representations. To this end, we propose an effective and efficient solution GPLR. To ensure effectiveness, GPLR leverages pre-trained LLMs to infer personas for customers. To reduce overhead, GPLR applies LLM-based labeling to only a fraction of users and utilizes a random walk technique to predict personas for the remaining customers. We further propose RevAff, which provides an absolute error $\epsilon$ guarantee while improving the time complexity of the exact solution by a factor of at least $O(\frac{\epsilon\cdot|E|N}{|E|+N\log N})$, where $N$ represents the number of customers and products, and $E$ represents the interactions between them. We evaluate the performance of our persona-based representation in terms of accuracy and robustness for recommendation and customer segmentation tasks using three real-world e-commerce datasets. Most notably, we find that integrating customer persona representations improves the state-of-the-art graph convolution-based recommendation model by up to 12% in terms of NDCG@K and F1-Score@K.
Comments:
SIGIR 2025
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2504.17304 [cs.IR]
(or
arXiv:2504.17304v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2504.17304
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Related DOI:
https://doi.org/10.1145/3726302.3730118
Focus to learn more
DOI(s) linking to related resources</p>
10. 【2504.17220】Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?
链接:https://arxiv.org/abs/2504.17220
作者:Kaidong Feng,Zhu Sun,Jie Yang,Hui Fang,Xinghua Qu,Wenyuan Liu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:bundle generation, bundle generation performance, increasingly explored, reasoning capabilities, knowledge
备注:
点击查看摘要
Abstract:LLMs are increasingly explored for bundle generation, thanks to their reasoning capabilities and knowledge. However, deploying large-scale LLMs introduces significant efficiency challenges, primarily high computational costs during fine-tuning and inference due to their massive parameterization. Knowledge distillation (KD) offers a promising solution, transferring expertise from large teacher models to compact student models. This study systematically investigates knowledge distillation approaches for bundle generation, aiming to minimize computational demands while preserving performance. We explore three critical research questions: (1) how does the format of KD impact bundle generation performance? (2) to what extent does the quantity of distilled knowledge influence performance? and (3) how do different ways of utilizing the distilled knowledge affect performance? We propose a comprehensive KD framework that (i) progressively extracts knowledge (patterns, rules, deep thoughts); (ii) captures varying quantities of distilled knowledge through different strategies; and (iii) exploits complementary LLM adaptation techniques (in-context learning, supervised fine-tuning, combination) to leverage distilled knowledge in small student models for domain-specific adaptation and enhanced efficiency. Extensive experiments provide valuable insights into how knowledge format, quantity, and utilization methodologies collectively shape LLM-based bundle generation performance, exhibiting KD's significant potential for more efficient yet effective LLM-based bundle generation.
11. 【2504.17045】Dynamic Superblock Pruning for Fast Learned Sparse Retrieval
链接:https://arxiv.org/abs/2504.17045
作者:Parker Carlson,Wentai Xie,Shanxiu He,Tao Yang
类目:Information Retrieval (cs.IR)
关键词:learned sparse representations, paper proposes superblock, proposes superblock pruning, top-k online document, online document retrieval
备注: 6 pages, 3 figures, SIGIR 25
点击查看摘要
Abstract:This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.
计算机视觉
1. 【2504.17791】LiDPM: Rethinking Point Diffusion for Lidar Scene Completion
链接:https://arxiv.org/abs/2504.17791
作者:Tetiana Martyniuk,Gilles Puy,Alexandre Boulch,Renaud Marlet,Raoul de Charette
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Training diffusion models, generating fine-grained details, Training diffusion, field of view, diffusion models
备注: Accepted to IEEE IV 2025
点击查看摘要
Abstract:Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is this https URL .
2. 【2504.17789】oken-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
链接:https://arxiv.org/abs/2504.17789
作者:Xu Ma,Peize Sun,Haoyu Ma,Hao Tang,Chih-Yao Ma,Jialiang Wang,Kunpeng Li,Xiaoliang Dai,Yujun Shi,Xuan Ju,Yushi Hu,Artsiom Sanakoyeu,Felix Juefei-Xu,Ji Hou,Junjiao Tian,Tao Xu,Tingbo Hou,Yen-Cheng Liu,Zecheng He,Zijian He,Matt Feiszli,Peizhao Zhang,Peter Vajda,Sam Tsai,Yun Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:competitive than Diffusion-based, Diffusion-based models, long dominant, increasingly applied, considered less competitive
备注:
点击查看摘要
Abstract:Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
3. 【2504.17788】Dynamic Camera Poses and Where to Find Them
链接:https://arxiv.org/abs/2504.17788
作者:Chris Rockwell,Joseph Tung,Tsung-Yi Lin,Ming-Yu Liu,David F. Fouhey,Chen-Hsuan Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic Internet videos, Internet videos, annotating dynamic Internet, realistic video generation, dynamic Internet
备注: Accepted to CVPR 2025. Project Page: [this https URL](https://research.nvidia.com/labs/dir/dynpose-100k)
点击查看摘要
Abstract:Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.
4. 【2504.17787】he Fourth Monocular Depth Estimation Challenge
链接:https://arxiv.org/abs/2504.17787
作者:Anton Obukhov,Matteo Poggi,Fabio Tosi,Ripudaman Singh Arora,Jaime Spencer,Chris Russell,Simon Hadfield,Richard Bowden,Shuaihang Wang,Zhenxin Ma,Weijie Chen,Baobei Xu,Fengyu Sun,Di Xie,Jiang Zhu,Mykola Lavreniuk,Haining Guan,Qun Wu,Yupei Zeng,Chao Lu,Huanran Wang,Guangyuan Zhou,Haotian Zhang,Jianxiong Wang,Qiang Rao,Chunjie Wang,Xiao Liu,Zhiqiang Lou,Hualie Jiang,Yihao Chen,Rui Xu,Minglang Tan,Zihan Qin,Yifan Mao,Jiayang Liu,Jialei Xu,Yifan Yang,Wenbo Zhao,Junjun Jiang,Xianming Liu,Mingshuai Zhao,Anlong Ming,Wu Chen,Feng Xue,Mengying Yu,Shida Gao,Xiangfeng Wang,Gbenga Omotara,Ramy Farag,Jacket Demby,Seyed Mohamad Ali Tousi,Guilherme N DeSouza,Tuan-Anh Yang,Minh-Quang Nguyen,Thien-Phuc Tran,Albert Luginov,Muhammad Shahzad
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monocular Depth Estimation, Depth Estimation Challenge, dataset featuring challenging, featuring challenging environments, Monocular Depth
备注: To appear in CVPRW2025
点击查看摘要
Abstract:This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.
5. 【2504.17761】Step1X-Edit: A Practical Framework for General Image Editing
链接:https://arxiv.org/abs/2504.17761
作者:Shiyu Liu,Yucheng Han,Peng Xing,Fukun Yin,Rui Wang,Wei Cheng,Jiaqi Liao,Yingming Wang,Honghao Fu,Chunrui Han,Guopeng Li,Yuang Peng,Quan Sun,Jingwei Wu,Yan Cai,Zheng Ge,Ranchen Ming,Lei Xia,Xianfang Zeng,Yibo Zhu,Binxing Jiao,Xiangyu Zhang,Gang Yu,Daxin Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapid development, image editing, witnessed remarkable, remarkable and rapid, image
备注: code: [this https URL](https://github.com/stepfun-ai/Step1X-Edit)
点击查看摘要
Abstract:In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
6. 【2504.17735】EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU Sensor
链接:https://arxiv.org/abs/2504.17735
作者:Akhil Padmanabha,Saravanan Govindarajan,Hwanmun Kim,Sergio Ortiz,Rahul Rajan,Doruk Senkal,Sneha Kadetotad
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Human activity recognition, low level, low level activity, level activity recognition, Human activity
备注:
点击查看摘要
Abstract:Human activity recognition (HAR) on smartglasses has various use cases, including health/fitness tracking and input for context-aware AI assistants. However, current approaches for egocentric activity recognition suffer from low performance or are resource-intensive. In this work, we introduce a resource (memory, compute, power, sample) efficient machine learning algorithm, EgoCHARM, for recognizing both high level and low level activities using a single egocentric (head-mounted) Inertial Measurement Unit (IMU). Our hierarchical algorithm employs a semi-supervised learning strategy, requiring primarily high level activity labels for training, to learn generalizable low level motion embeddings that can be effectively utilized for low level activity recognition. We evaluate our method on 9 high level and 3 low level activities achieving 0.826 and 0.855 F1 scores on high level and low level activity recognition respectively, with just 63k high level and 22k low level model parameters, allowing the low level encoder to be deployed directly on current IMU chips with compute. Lastly, we present results and insights from a sensitivity analysis and highlight the opportunities and limitations of activity recognition using egocentric IMUs.
7. 【2504.17732】DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model
链接:https://arxiv.org/abs/2504.17732
作者:Zhanwen Liu,Sai Zhou,Yuchao Dai,Yang Wang,Yisheng An,Xiangmo Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significantly reducing training, reducing training costs, deployment complexity compared, address multiple image, design dedicated models
备注:
点击查看摘要
Abstract:All-in-One image restoration aims to address multiple image degradation problems using a single model, significantly reducing training costs and deployment complexity compared to traditional methods that design dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM) and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained modeling of complex degradation information and efficient global integration, while mitigating the loss of high-frequency details caused by task competition. Specifically, the DP-SSM utilizes a pre-trained degradation extractor to capture fine-grained degradation features and dynamically incorporates them into the state space modeling process, enhancing the model's adaptability to diverse degradation types. Concurrently, the HEB supplements high-frequency information, effectively addressing the loss of critical details, such as edges and textures, in multi-task image restoration scenarios. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
8. 【2504.17728】CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos
链接:https://arxiv.org/abs/2504.17728
作者:Shucheng Gong,Lingzhe Zhao,Wenpu Li,Hong Xie,Yin Zhang,Shiyu Zhao,Peidong Liu
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Gaussian Splatting, neural radiance field, garnered widespread attention, widespread attention due, photo-realistic novel view
备注: Source Code: [this https URL](https://github.com/WU-CVGL/CasualHDRSplat)
点击查看摘要
Abstract:Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance. However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details. Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice. For a more flexible data acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time. \textbf{CasualHDRSplat} contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene. Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. Our source code will be available at this https URL
9. 【2504.17712】Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields
链接:https://arxiv.org/abs/2504.17712
作者:Zhuo He,Paul Henderson,Nicolas Pugeault
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthesize highly-realistic faces, latent space, random noise, demonstrated the ability, ability of GANs
备注:
点击查看摘要
Abstract:StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.
10. 【2504.17696】Hierarchical and Multimodal Data for Daily Activity Understanding
链接:https://arxiv.org/abs/2504.17696
作者:Ghazal Kaviani,Yavuz Yarici,Seulgi Kim,Mohit Prabhushankar,Ghassan AlRegib,Mashhour Solh,Ameya Patil
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Daily Activity Recordings, Daily Activity, understand human activities, hierarchically annotated dataset, real-world settings
备注:
点击查看摘要
Abstract:Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2504.17696 [cs.CV]
(or
arXiv:2504.17696v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2504.17696
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Ghazal Kaviani [view email] [v1]
Thu, 24 Apr 2025 16:04:00 UTC (40,427 KB)
11. 【2504.17695】PICO: Reconstructing 3D People In Contact with Objects
链接:https://arxiv.org/abs/2504.17695
作者:Alpár Cseke,Shashank Tripathi,Sai Kumar Dwivedi,Arjun Lakshmipathy,Agniv Chatterjee,Michael J. Black,Dimitrios Tzionas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:object, single color images, Recovering, depth ambiguities, contact
备注: Accepted in CVPR'25. Project Page: [this https URL](https://pico.is.tue.mpg.de)
点击查看摘要
Abstract:Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at this https URL.
12. 【2504.17693】BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring
链接:https://arxiv.org/abs/2504.17693
作者:Asier Bikandi,Muhammad Shaheer,Hriday Bavle,Jayan Jevanesan,Holger Voos,Jose Luis Sanchez-Lopez
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Augmented reality, real-time environmental tracking, visualize architectural elements, construction monitoring rely, monitoring rely
备注:
点击查看摘要
Abstract:Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
13. 【2504.17670】DiMeR: Disentangled Mesh Reconstruction Model
链接:https://arxiv.org/abs/2504.17670
作者:Lutao Jiang,Jiantao Lin,Kanghao Chen,Wenhang Ge,Xin Yang,Yifan Jiang,Yuanhuiyi Lyu,Xu Zheng,Yingcong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Reconstruction Model, achieved remarkable success, gained significant attention, Large Reconstruction, advent of large-scale
备注: Project Page: [this https URL](https://lutao2021.github.io/DiMeR_page/)
点击查看摘要
Abstract:With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam's Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network's input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.
14. 【2504.17655】Aerial Image Classification in Scarce and Unconstrained Environments via Conformal Prediction
链接:https://arxiv.org/abs/2504.17655
作者:Farhad Pourkamali-Anaraki
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:challenging aerial image, aerial image dataset, image dataset featuring, dataset featuring diverse, featuring diverse events
备注: 17 pages, 5 figures, and 2 tables
点击查看摘要
Abstract:This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.
15. 【2504.17643】CLIPSE -- a minimalistic CLIP-based image search engine for research
链接:https://arxiv.org/abs/2504.17643
作者:Steve Göring
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:self-hosted image search, image search engine, application of research, search engine, main application
备注:
点击查看摘要
Abstract:A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided. In general, CLIPSE uses CLIP embeddings to process the images and also the text queries. The overall framework is designed with simplicity to enable easy extension and usage. Two benchmark scenarios are described and evaluated, covering indexing and querying time. It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.
16. 【2504.17636】A Guide to Structureless Visual Localization
链接:https://arxiv.org/abs/2504.17636
作者:Vojtech Panek,Qunjie Zhou,Yaqing Ding,Sérgio Agostinho,Zuzana Kukelova,Torsten Sattler,Laura Leal-Taixé
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual localization algorithms, mixed reality systems, including self-driving cars, Visual localization, including self-driving
备注:
点击查看摘要
Abstract:Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.
17. 【2504.17626】Improving Open-World Object Localization by Discovering Background
链接:https://arxiv.org/abs/2504.17626
作者:Ashish Singh,Michael J. Jones,Kuan-Chuan Peng,Anoop Cherian,Moitreya Chatterjee,Erik Learned-Miller
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:bounding box information, unseen classes, open-world setting, bounding box, limited number
备注:
点击查看摘要
Abstract:Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.
18. 【2504.17619】Enhancing CNNs robustness to occlusions with bioinspired filters for border completion
链接:https://arxiv.org/abs/2504.17619
作者:Catarina P. Coutinho,Aneeqa Merhab,Janko Petkovic,Ferdinando Zanchetta,Rita Fioresi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visual cortex mechanism, define custom filters, filters for CNNs, exploit the mathematical, mathematical modeling
备注: Submitted to the 7th International Conference on Geometric Science of Information
点击查看摘要
Abstract:We exploit the mathematical modeling of the visual cortex mechanism for border completion to define custom filters for CNNs. We see a consistent improvement in performance, particularly in accuracy, when our modified LeNet 5 is tested with occluded MNIST images.
19. 【2504.17618】he effects of Hessian eigenvalue spectral density type on the applicability of Hessian analysis to generalization capability assessment of neural networks
链接:https://arxiv.org/abs/2504.17618
作者:Nikita Gabdullin
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:HESD, neural network, essential information, previously proposed generalization, Hessian
备注: 11 pages, 10 figures, 4 tables, 4 equations
点击查看摘要
Abstract:Hessians of neural network (NN) contain essential information about the curvature of NN loss landscapes which can be used to estimate NN generalization capabilities. We have previously proposed generalization criteria that rely on the observation that Hessian eigenvalue spectral density (HESD) behaves similarly for a wide class of NNs. This paper further studies their applicability by investigating factors that can result in different types of HESD. We conduct a wide range of experiments showing that HESD mainly has positive eigenvalues (MP-HESD) for NN training and fine-tuning with various optimizers on different datasets with different preprocessing and augmentation procedures. We also show that mainly negative HESD (MN-HESD) is a consequence of external gradient manipulation, indicating that the previously proposed Hessian analysis methodology cannot be applied in such cases. We also propose criteria and corresponding conditions to determine HESD type and estimate NN generalization potential. These HESD types and previously proposed generalization criteria are combined into a unified HESD analysis methodology. Finally, we discuss how HESD changes during training, and show the occurrence of quasi-singular (QS) HESD and its influence on the proposed methodology and on the conventional assumptions about the relation between Hessian eigenvalues and NN loss landscape curvature.
20. 【2504.17609】STCL:Curriculum learning Strategies for deep learning image steganography models
链接:https://arxiv.org/abs/2504.17609
作者:Fengchun Liu,Tong Zhang,Chunying Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Steganography Curriculum Learning, Curriculum Learning training, deep learning image, Curriculum Learning, deep learning
备注:
点击查看摘要
Abstract:Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \href{this https URL}{this https URL}.
21. 【2504.17595】RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network
链接:https://arxiv.org/abs/2504.17595
作者:Boyue Xu,Yi Xu,Ruichao Hou,Jia Bei,Tongwei Ren,Gangshan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advancing RGB-Depth, integration of dual-modal, pivotal in advancing, Hierarchical Modality Aggregation, HMAD
备注:
点击查看摘要
Abstract:The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD's capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.
22. 【2504.17594】amper-evident Image using JPEG Fixed Points
链接:https://arxiv.org/abs/2504.17594
作者:Zhaofeng Si,Siwei Lyu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:repeating JPEG compression, JPEG compression, decades ago, change anymore, repeating JPEG
备注: 6 pages, 6 figures
点击查看摘要
Abstract:An intriguing phenomenon about JPEG compression has been observed since two decades ago- after repeating JPEG compression and decompression, it leads to a stable image that does not change anymore, which is a fixed point. In this work, we prove the existence of fixed points in the essential JPEG procedures. We analyze JPEG compression and decompression processes, revealing the existence of fixed points that can be reached within a few iterations. These fixed points are diverse and preserve the image's visual quality, ensuring minimal distortion. This result is used to develop a method to create a tamper-evident image from the original authentic image, which can expose tampering operations by showing deviations from the fixed point image.
23. 【2504.17582】Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images
链接:https://arxiv.org/abs/2504.17582
作者:Zebo Huang,Yinghui Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:estimation network tailored, monocular images, aiming to infer, network tailored, gastrointestinal tract
备注:
点击查看摘要
Abstract:We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes, aiming to infer depth within the gastrointestinal tract from monocular images. Existing methods, though accurate, typically assume consistent illumination, which is often violated due to dynamic lighting and occlusions caused by GI motility. These variations lead to incorrect geometric interpretations and unreliable self-supervised signals, degrading depth reconstruction quality. To address this, we introduce an occlusion-aware self-supervised framework. First, we incorporate an occlusion mask for data augmentation, generating pseudo-labels by simulating viewpoint-dependent occlusion scenarios. This enhances the model's ability to learn robust depth features under partial visibility. Second, we leverage semantic segmentation guided by non-negative matrix factorization, clustering convolutional activations to generate pseudo-labels in texture-deprived regions, thereby improving segmentation accuracy and mitigating information loss from lighting changes. Experimental results on the SCARED dataset show that our method achieves state-of-the-art performance in self-supervised depth estimation. Additionally, evaluations on the Endo-SLAM and SERV-CT datasets demonstrate strong generalization across diverse endoscopic environments.
24. 【2504.17551】Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior
链接:https://arxiv.org/abs/2504.17551
作者:Lin Che,Yizi Chen,Tanhua Jin,Martin Raubal,Konrad Schindler,Peter Kiefer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:resource management, environmental monitoring, street view images, street view, Urban
备注: 11 pages, 7 figures, preprint version
点击查看摘要
Abstract:Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at this https URL.
25. 【2504.17547】A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
链接:https://arxiv.org/abs/2504.17547
作者:Jiaqi Deng,Zonghan Wu,Huan Huo,Guandong Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:Vision Question Answering, Knowledge-based Vision Question, general Vision Question, Question Answering, Vision Question
备注: 20 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.
26. 【2504.17545】When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering
链接:https://arxiv.org/abs/2504.17545
作者:Keyang Ye,Tianjia Shao,Kun Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:introduce Gaussian-enhanced Surfels, surfels supplement fine-scale, view-dependent colors represent, introduce Gaussian-enhanced, fine-scale appearance details
备注:
点击查看摘要
Abstract:We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for radiance field rendering, wherein a set of 2D opaque surfels with view-dependent colors represent the coarse-scale geometry and appearance of scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale appearance details. The rendering with GESs consists of two passes -- surfels are first rasterized through a standard graphics pipeline to produce depth and color maps, and then Gaussians are splatted with depth testing and color accumulation on each pixel order independently. The optimization of GESs from multi-view images is performed through an elaborate coarse-to-fine procedure, faithfully capturing rich scene appearance. The entirely sorting-free rendering of GESs not only achieves very fast rates, but also produces view-consistent images, successfully avoiding popping artifacts under view changes. The basic GES representation can be easily extended to achieve anti-aliasing in rendering (Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage (Compact-GES), and reconstruct better scene geometries by replacing 3D Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs advance the state-of-the-arts as a compelling representation for ultra-fast high-fidelity radiance field rendering.
27. 【2504.17540】An Explainable Nature-Inspired Framework for Monkeypox Diagnosis: Xception Features Combined with NGBoost and African Vultures Optimization Algorithm
链接:https://arxiv.org/abs/2504.17540
作者:Ahmadreza Shateri,Negar Nourani,Morteza Dorrigiv,Hamid Nasiri
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:public health concerns, recent global spread, raised significant public, significant public health, historically been prevalent
备注:
点击查看摘要
Abstract:The recent global spread of monkeypox, particularly in regions where it has not historically been prevalent, has raised significant public health concerns. Early and accurate diagnosis is critical for effective disease management and control. In response, this study proposes a novel deep learning-based framework for the automated detection of monkeypox from skin lesion images, leveraging the power of transfer learning, dimensionality reduction, and advanced machine learning techniques. We utilize the newly developed Monkeypox Skin Lesion Dataset (MSLD), which includes images of monkeypox, chickenpox, and measles, to train and evaluate our models. The proposed framework employs the Xception architecture for deep feature extraction, followed by Principal Component Analysis (PCA) for dimensionality reduction, and the Natural Gradient Boosting (NGBoost) algorithm for classification. To optimize the model's performance and generalization, we introduce the African Vultures Optimization Algorithm (AVOA) for hyperparameter tuning, ensuring efficient exploration of the parameter space. Our results demonstrate that the proposed AVOA-NGBoost model achieves state-of-the-art performance, with an accuracy of 97.53%, F1-score of 97.72% and an AUC of 97.47%. Additionally, we enhance model interpretability using Grad-CAM and LIME techniques, providing insights into the decision-making process and highlighting key features influencing classification. This framework offers a highly precise and efficient diagnostic tool, potentially aiding healthcare providers in early detection and diagnosis, particularly in resource-constrained environments.
28. 【2504.17525】xt-to-Image Alignment in Denoising-Based Models through Step Selection
链接:https://arxiv.org/abs/2504.17525
作者:Paul Grimal,Hervé Le Borgne,Olivier Ferret
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:encounter challenges related, Visual generative, reasoning limitations, encounter challenges, challenges related
备注:
点击查看摘要
Abstract:Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.
29. 【2504.17524】ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image Inpainting
链接:https://arxiv.org/abs/2504.17524
作者:Junyan Zhang,Yan Li,Mengxiao Geng,Liu Shi,Qiegen Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:damaged regions, restore missing, diffusion model, reconstructing missing areas, Image
备注: 11 pages,10 figures,Submit to tcsvt
点击查看摘要
Abstract:Image inpainting is a technique used to restore missing or damaged regions of an image. Traditional methods primarily utilize information from adjacent pixels for reconstructing missing areas, while they struggle to preserve complex details and structures. Simultaneously, models based on deep learning necessitate substantial amounts of training data. To address this challenge, an encoding strategy-inspired diffusion model with few-shot learning for color image inpainting is proposed in this paper. The main idea of this novel encoding strategy is the deployment of a "virtual mask" to construct high-dimensional objects through mutual perturbations between channels. This approach enables the diffusion model to capture diverse image representations and detailed features from limited training samples. Moreover, the encoding strategy leverages redundancy between channels, integrates with low-rank methods during iterative inpainting, and incorporates the diffusion model to achieve accurate information output. Experimental results indicate that our method exceeds current techniques in quantitative metrics, and the reconstructed images quality has been improved in aspects of texture and structural integrity, leading to more precise and coherent results.
30. 【2504.17522】owards One-Stage End-to-End Table Structure Recognition with Parallel Regression for Diverse Scenarios
链接:https://arxiv.org/abs/2504.17522
作者:Anyi Xiao,Cihui Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:structure recognition aims, machine-understandable formats, recognition aims, unstructured data, data into machine-understandable
备注:
点击查看摘要
Abstract:Table structure recognition aims to parse tables in unstructured data into machine-understandable formats. Recent methods address this problem through a two-stage process or optimized one-stage approaches. However, these methods either require multiple networks to be serially trained and perform more time-consuming sequential decoding, or rely on complex post-processing algorithms to parse the logical structure of tables. They struggle to balance cross-scenario adaptability, robustness, and computational efficiency. In this paper, we propose a one-stage end-to-end table structure parsing network called TableCenterNet. This network unifies the prediction of table spatial and logical structure into a parallel regression task for the first time, and implicitly learns the spatial-logical location mapping laws of cells through a synergistic architecture of shared feature extraction layers and task-specific decoding. Compared with two-stage methods, our method is easier to train and faster to infer. Experiments on benchmark datasets show that TableCenterNet can effectively parse table structures in diverse scenarios and achieve state-of-the-art performance on the TableGraph-24k dataset. Code is available at this https URL.
31. 【2504.17515】Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation
链接:https://arxiv.org/abs/2504.17515
作者:Zihan Cheng,Jintao Guo,Jian Zhang,Lei Qi,Luping Zhou,Yinghuan Shi,Yang Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image segmentation, unseen target domains, segment medical images, medical image, image segmentation
备注: Accepted by IEEE TMI 2025. The code is available at [this https URL](https://github.com/orange-czh/Mamba-Sea)
点击查看摘要
Abstract:To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at this https URL.
32. 【2504.17502】RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
链接:https://arxiv.org/abs/2504.17502
作者:Aviv Slobodkin,Hagai Taitelbaum,Yonatan Bitton,Brian Gordon,Michal Sokolik,Nitzan Bitton Guetta,Almog Gueta,Royi Rassin,Itay Laish,Dani Lischinski,Idan Szpektor
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:aims to produce, preserving the visual, visual identity, Subject-driven, referenced subject image
备注:
点击查看摘要
Abstract:Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.
33. 【2504.17474】Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy Data
链接:https://arxiv.org/abs/2504.17474
作者:Weiran Pan,Wei Wei,Feida Zhu,Yong Deng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:correctly labeled, correctly labeled samples, labeled, image classification, correctly
备注:
点击查看摘要
Abstract:We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model's prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.
34. 【2504.17457】Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks
链接:https://arxiv.org/abs/2504.17457
作者:Zhiying Li,Yeying Jin,Fan Shen,Zhi Liu,Weibin Chen,Pengju Zhang,Xiaomei Zhang,Boyu Chen,Michael Shen,Kejian Wu,Zhaoxin Fan,Jin Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Expressive human pose, digital human generation, Expressive human, live streaming, pose and shape
备注: 14 pages, 7 figures
点击查看摘要
Abstract:Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack (TBA)}, a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbf{adversarial loss function} to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA's superiority, achieving a remarkable 41.0\% increase in estimation error, with an average improvement of approximately 17.0\%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems.
35. 【2504.17447】FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
链接:https://arxiv.org/abs/2504.17447
作者:De-An Huang,Subhashree Radhakrishnan,Zhiding Yu,Jan Kautz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Multimodal Models, Large Multimodal, progress in Large, Multimodal Models, long
备注:
点击查看摘要
Abstract:There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: this https URL
36. 【2504.17441】Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
链接:https://arxiv.org/abs/2504.17441
作者:Mingxuan Wu,Huang Huang,Justin Kerr,Chung Min Kim,Anthony Zhang,Brent Yi,Angjoo Kanazawa
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans can resort, inspection to build, build intuition, Humans, POD
备注: See our website at: [this https URL](https://predict-optimize-distill.github.io/pod.github.io) First two authors contributed equally
点击查看摘要
Abstract:Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.
37. 【2504.17432】Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
链接:https://arxiv.org/abs/2504.17432
作者:Tiancheng Gu,Kaicheng Yang,Ziyong Feng,Xingjun Wang,Yanzhao Zhang,Dingkun Long,Yingda Chen,Weidong Cai,Jiankang Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Contrastive Language-Image Pre-training, Language-Image Pre-training, Contrastive Language-Image, Universal Multimodal Embedding, Multimodal Large Language
备注: 13 pages, 8 figures, Project page: [this https URL](https://garygutc.github.io/UniME)
点击查看摘要
Abstract:The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains this http URL this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
38. 【2504.17414】3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models
链接:https://arxiv.org/abs/2504.17414
作者:Min Wei,Chaohui Yu,Jingkai Zhou,Fan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video try-on replaces, try-on replaces clothing, Video, temporally consistent, Video try-on
备注: Project page: [this https URL](https://2y7c3.github.io/3DV-TON/)
点击查看摘要
Abstract:Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link this https URL
39. 【2504.17401】StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies
链接:https://arxiv.org/abs/2504.17401
作者:Xu Wang,Jialang Xu,Shuai Zhang,Baoru Huang,Danail Stoyanov,Evangelos B. Mazomenos
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:minimally invasive surgery, robot-assisted minimally invasive, obtaining depth information, Stereo disparity estimation, invasive surgery
备注:
点击查看摘要
Abstract:Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.
40. 【2504.17399】S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception
链接:https://arxiv.org/abs/2504.17399
作者:Sven Teufel,Jörg Gamerdinger,Oliver Bringmann
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Collective Perception, realize collective perception, autonomous driving, promising approach, approach to overcome
备注:
点击查看摘要
Abstract:Collective Perception (CP) has emerged as a promising approach to overcome the limitations of individual perception in the context of autonomous driving. Various approaches have been proposed to realize collective perception; however, the Sensor2Sensor domain gap that arises from the utilization of different sensor systems in Connected and Automated Vehicles (CAVs) remains mostly unaddressed. This is primarily due to the paucity of datasets containing heterogeneous sensor setups among the CAVs. The recently released SCOPE datasets address this issue by providing data from three different LiDAR sensors for each CAV. This study is the first to tackle the Sensor2Sensor domain gap in vehicle to vehicle (V2V) collective perception. First, we present our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the Sensor2Sensor domain adaptation capabilities of S2S-Net on the SCOPE dataset is conducted. S2S-Net demonstrates the capability to maintain very high performance in unseen sensor domains and achieved state-of-the-art results on the SCOPE dataset.
41. 【2504.17397】Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models
链接:https://arxiv.org/abs/2504.17397
作者:Francesc Marti-Escofet,Benedikt Blumenstiel,Linus Scheibenreif,Paolo Fraccaro,Konrad Schindler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth observation, managing natural resources, responding to disasters, crucial for monitoring, monitoring environmental
备注: Code available at [this https URL](https://github.com/IBM/peft-geofm)
点击查看摘要
Abstract:Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.
42. 【2504.17395】SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
链接:https://arxiv.org/abs/2504.17395
作者:Yiming Zhao,Guorong Li,Laiyun Qing,Amin Beheshti,Jian Yang,Michael Sheng,Yuankai Qi,Qingming Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pre-trained vision-language models, textual queries, unseen categories, pre-trained vision-language, categories
备注:
点击查看摘要
Abstract:Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.
43. 【2504.17371】Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset
链接:https://arxiv.org/abs/2504.17371
作者:Oussema Dhaouadi,Johannes Meier,Luca Wahl,Jacques Kaiser,Luca Scalerandi,Nick Wandelburg,Zhuolun Zhou,Nijanthan Berinpanathan,Holger Banzhaf,Daniel Cremers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:data is crucial, crucial for advancing, advancing autonomous driving, Accurate, Dataset
备注:
点击查看摘要
Abstract:Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at this http URL, facilitating research in motion prediction, behavior modeling, and safety validation.
44. 【2504.17365】meSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
链接:https://arxiv.org/abs/2504.17365
作者:Ling You,Wenxuan Huang,Xinni Xie,Xiangyi Wei,Bangyan Li,Shaohui Lin,Yang Li,Changbo Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:popular sporting event, distinctive highlight moments, globally popular sporting, Multimodal Large Language, Large Language Models
备注:
点击查看摘要
Abstract:Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.
45. 【2504.17364】I-INR: Iterative Implicit Neural Representations
链接:https://arxiv.org/abs/2504.17364
作者:Ali Haider,Muhammad Salman Ali,Maryam Qamar,Tahir Khalil,Soo Ye Kim,Jihyong Oh,Enzo Tartaglione,Sung-Ho Bae
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit Neural Representations, differentiable functions parameterized, Neural Representations, Iterative Implicit Neural, revolutionized signal processing
备注:
点击查看摘要
Abstract:Implicit Neural Representations (INRs) have revolutionized signal processing and computer vision by modeling signals as continuous, differentiable functions parameterized by neural networks. However, their inherent formulation as a regression problem makes them prone to regression to the mean, limiting their ability to capture fine details, retain high-frequency information, and handle noise effectively. To address these challenges, we propose Iterative Implicit Neural Representations (I-INRs) a novel plug-and-play framework that enhances signal reconstruction through an iterative refinement process. I-INRs effectively recover high-frequency details, improve robustness to noise, and achieve superior reconstruction quality. Our framework seamlessly integrates with existing INR architectures, delivering substantial performance gains across various tasks. Extensive experiments show that I-INRs outperform baseline methods, including WIRE, SIREN, and Gauss, in diverse computer vision applications such as image restoration, image denoising, and object occupancy prediction.
46. 【2504.17353】M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction
链接:https://arxiv.org/abs/2504.17353
作者:Chengguang Gan,Sunbowen Lee,Zhixi Cai,Yanbin Wei,Lei Zheng,Yunhao Liang,Shiwen Ni,Tatsunori Mori
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Mutual Reinforcement Effect, Reinforcement Effect, Mutual Reinforcement, MRE, Multimodal Mutual Reinforcement
备注:
点击查看摘要
Abstract:Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.
47. 【2504.17349】DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition
链接:https://arxiv.org/abs/2504.17349
作者:Yiyan Xu,Wuqiang Zheng,Wenjie Wang,Fengbin Zhu,Xinting Hu,Yang Zhang,Fuli Feng,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:multimodal content creation, Personalized image generation, Large Multimodal Models, content creation, promising direction
备注:
点击查看摘要
Abstract:Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Cite as:
arXiv:2504.17349 [cs.CV]
(or
arXiv:2504.17349v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2504.17349
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
48. 【2504.17343】meChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
链接:https://arxiv.org/abs/2504.17343
作者:Linli Yao,Yicheng Li,Yuancheng Wei,Lei Li,Shuhuai Ren,Yuanxin Liu,Kun Ouyang,Lean Wang,Shicheng Li,Sida Li,Lingpeng Kong,Qi Liu,Yuanxing Zhang,Xu Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:live streaming services, Large Language Models, video understanding systems, rapid growth, created an urgent
备注:
点击查看摘要
Abstract:The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
49. 【2504.17315】DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
链接:https://arxiv.org/abs/2504.17315
作者:Zhanglin Wu,Tengfei Song,Ning Xie,Weidong Zhang,Pengfei Li,Shuang Wu,Chong Li,Junhao Zhu,Hao Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Translation Service Center, Huawei Translation Service, Service Center, Complex Layouts, International Conference
备注: 7 pages, 1 figures, 2 tables
点击查看摘要
Abstract:This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
50. 【2504.17314】Class-Conditional Distribution Balancing for Group Robust Classification
链接:https://arxiv.org/abs/2504.17314
作者:Miaoyun Zhao,Qiang Zhang,Chenrong Li
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:wrong reasons pose, robust real-world generalization, real-world generalization, wrong reasons, reasons pose
备注:
点击查看摘要
Abstract:Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust real-world generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and becomes impractical for resource-limited rare domains. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances or mismatches in class-conditional distributions, and propose a simple yet effective robust learning method that eliminates the need for both bias annotations and predictions. With the goal of reducing the mutual information between spurious factors and label information, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.
51. 【2504.17306】Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3+
链接:https://arxiv.org/abs/2504.17306
作者:Meher Boulaabi,Takwa Ben Aïcha Gader,Afef Kacem Echi,Sameh Mbarek
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:binary segmentation method, soft exudates, diabetic retinopathy lesions, implemented a binary, segmentation method specific
备注: This work was accepted at the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA) 2024
点击查看摘要
Abstract:To improve the segmentation of diabetic retinopathy lesions (microaneurysms, hemorrhages, exudates, and soft exudates), we implemented a binary segmentation method specific to each type of lesion. As post-segmentation, we combined the individual model outputs into a single image to better analyze the lesion types. This approach facilitated parameter optimization and improved accuracy, effectively overcoming challenges related to dataset limitations and annotation complexity. Specific preprocessing steps included cropping and applying contrast-limited adaptive histogram equalization to the L channel of the LAB image. Additionally, we employed targeted data augmentation techniques to further refine the model's efficacy. Our methodology utilized the DeepLabv3+ model, achieving a segmentation accuracy of 99%. These findings highlight the efficacy of innovative strategies in advancing medical image analysis, particularly in the precise segmentation of diabetic retinopathy lesions. The IDRID dataset was utilized to validate and demonstrate the robustness of our approach.
52. 【2504.17280】EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy
链接:https://arxiv.org/abs/2504.17280
作者:Haodi Yao,Fenghua He,Ning Hao,Chen Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Structure from Motion, Localization and Mapping, Simultaneous Localization, deep learning techniques, leveraging deep learning
备注:
点击查看摘要
Abstract:The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.
53. 【2504.17269】owards Generalized and Training-Free Text-Guided Semantic Manipulation
链接:https://arxiv.org/abs/2504.17269
作者:Yu Hong,Xiao Cai,Pengpeng Zeng,Shuai Zhang,Jingkuan Song,Lianli Gao,Heng Tao Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving irrelevant contents, Text-guided semantic manipulation, semantic manipulation refers, source prompt, target prompt
备注:
点击查看摘要
Abstract:Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
54. 【2504.17263】Precision Neural Network Quantization via Learnable Adaptive Modules
链接:https://arxiv.org/abs/2504.17263
作者:Wenqiang Zhou,Zhendong Yu,Xinyu Liu,Jiaming Yang,Rong Xiao,Tao Wang,Chenwei Tang,Jiancheng Lv
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Complexity (cs.CC)
关键词:Quantization Aware Training, Aware Training, compresses model size, Quantization Aware, network quantization technique
备注:
点击查看摘要
Abstract:Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency while effectively maintaining model performance. The paradigm of QAT is to introduce fake quantization operators during the training process, allowing the model to autonomously compensate for information loss caused by quantization. Making quantization parameters trainable can significantly improve the performance of QAT, but at the cost of compromising the flexibility during inference, especially when dealing with activation values with substantially different distributions. In this paper, we propose an effective learnable adaptive neural network quantization method, called Adaptive Step Size Quantization (ASQ), to resolve this conflict. Specifically, the proposed ASQ method first dynamically adjusts quantization scaling factors through a trained module capable of accommodating different activations. Then, to address the rigid resolution issue inherent in Power of Two (POT) quantization, we propose an efficient non-uniform quantization scheme. We utilize the Power Of Square root of Two (POST) as the basis for exponential quantization, effectively handling the bell-shaped distribution of neural network weights across various bit-widths while maintaining computational efficiency through a Look-Up Table method (LUT). Extensive experimental results demonstrate that the proposed ASQ method is superior to the state-of-the-art QAT approaches. Notably that the ASQ is even competitive compared to full precision baselines, with its 4-bit quantized ResNet34 model improving accuracy by 1.2\% on ImageNet.
55. 【2504.17258】Group Downsampling with Equivariant Anti-aliasing
链接:https://arxiv.org/abs/2504.17258
作者:Md Ashiqur Rahman,Raymond A. Yeh
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Group Theory (math.GR)
关键词:crucial building blocks, learning high-level features, blocks in CNN, CNN architectures, amount of memory
备注:
点击查看摘要
Abstract:Downsampling layers are crucial building blocks in CNN architectures, which help to increase the receptive field for learning high-level features and reduce the amount of memory/computation in the model. In this work, we study the generalization of the uniform downsampling layer for group equivariant architectures, e.g., G-CNNs. That is, we aim to downsample signals (feature maps) on general finite groups with anti-aliasing. This involves the following: (a) Given a finite group and a downsampling rate, we present an algorithm to form a suitable choice of subgroup. (b) Given a group and a subgroup, we study the notion of bandlimited-ness and propose how to perform anti-aliasing. Notably, our method generalizes the notion of downsampling based on classical sampling theory. When the signal is on a cyclic group, i.e., periodic, our method recovers the standard downsampling of an ideal low-pass filter followed by a subsampling operation. Finally, we conducted experiments on image classification tasks demonstrating that the proposed downsampling operation improves accuracy, better preserves equivariance, and reduces model size when incorporated into G-equivariant networks
56. 【2504.17253】DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks
链接:https://arxiv.org/abs/2504.17253
作者:Yinqi Li,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:shown remarkable progress, Diffusion models, video generation, shown remarkable, remarkable progress
备注: Accepted by IEEE Transactions on Multimedia
点击查看摘要
Abstract:Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at this https URL .
57. 【2504.17234】Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessment
链接:https://arxiv.org/abs/2504.17234
作者:Zhiqiang Lao,Heather Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapid advancement, advancement of artificial, artificial intelligence, intelligence and widespread, smartphones have resulted
备注:
点击查看摘要
Abstract:The rapid advancement of artificial intelligence and widespread use of smartphones have resulted in an exponential growth of image data, both real (camera-captured) and virtual (AI-generated). This surge underscores the critical need for robust image quality assessment (IQA) methods that accurately reflect human visual perception. Traditional IQA techniques primarily rely on spatial features - such as signal-to-noise ratio, local structural distortions, and texture inconsistencies - to identify artifacts. While effective for unprocessed or conventionally altered images, these methods fall short in the context of modern image post-processing powered by deep neural networks (DNNs). The rise of DNN-based models for image generation, enhancement, and restoration has significantly improved visual quality, yet made accurate assessment increasingly complex. To address this, we propose a novel IQA approach that bridges the gap between deep learning methods and human perception. Our model disentangles deep features into high-level semantic information and low-level perceptual details, treating each stream separately. These features are then combined with conventional IQA metrics to provide a more comprehensive evaluation framework. This hybrid design enables the model to assess both global context and intricate image details, better reflecting the human visual process, which first interprets overall structure before attending to fine-grained elements. The final stage employs a multilayer perceptron (MLP) to map the integrated features into a concise quality score. Experimental results demonstrate that our method achieves improved consistency with human perceptual judgments compared to existing IQA models.
58. 【2504.17229】Range Image-Based Implicit Neural Compression for LiDAR Point Clouds
链接:https://arxiv.org/abs/2504.17229
作者:Akihiro Kuwabara,Sorachi Kato,Takuya Fujihashi,Toshiaki Koike-Akino,Takashi Watanabe
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficiently compress Light, compress Light Detection, compress Light, scene archives, enabling high-precision
备注:
点击查看摘要
Abstract:This paper presents a novel scheme to efficiently compress Light Detection and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives, and such archives pave the way for a detailed understanding of the corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight format for representing 3D LiDAR observations. Although conventional image compression techniques can be adapted to improve compression efficiency for RIs, their practical performance is expected to be limited due to differences in bit precision and the distinct pixel value distribution characteristics between natural images and RIs. We propose a novel implicit neural representation~(INR)--based RI compression method that effectively handles floating-point valued pixels. The proposed method divides RIs into depth and mask images and compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization, respectively. Experiments on the KITTI dataset show that the proposed method outperforms existing image, point cloud, RI, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates and decoding latency.
59. 【2504.17224】Visual and textual prompts for enhancing emotion recognition in video
链接:https://arxiv.org/abs/2504.17224
作者:Zhifeng Wang,Qixuan Zhang,Peter Zhang,Wenjia Niu,Kaihao Zhang,Ramesh Sankaranarayana,Sabrina Caldwell,Tom Gedeon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Large Language, Large Language Models, Vision Large, exhibit promising potential, Language Models
备注: 12 pages, 10 figures
点击查看摘要
Abstract:Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others' emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs' video emotion recognition capabilities.
60. 【2504.17223】owards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion
链接:https://arxiv.org/abs/2504.17223
作者:Mengyu Qiao,Runze Tian,Yang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:encountering unseen forgeries, suffer significant performance, significant performance degradation, deep generative models, generative models poses
备注:
点击查看摘要
Abstract:The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.
61. 【2504.17213】MCAF: Efficient Agent-based Video Understanding Framework through Multimodal Coarse-to-Fine Attention Focusing
链接:https://arxiv.org/abs/2504.17213
作者:Shiwen Cao,Zhaoxing Zhang,Junming Jiao,Juyi Qiao,Guowen Song,Rong Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remains highly challenging, requiring large models, large models, era of rapid, rapid advances
备注:
点击查看摘要
Abstract:Even in the era of rapid advances in large models, video understanding, particularly long videos, remains highly challenging. Compared with textual or image-based information, videos commonly contain more information with redundancy, requiring large models to strategically allocate attention at a global level for accurate comprehension. To address this, we propose MCAF, an agent-based, training-free framework perform video understanding through Multimodal Coarse-to-fine Attention Focusing. The key innovation lies in its ability to sense and prioritize segments of the video that are highly relevant to the understanding task. First, MCAF hierarchically concentrates on highly relevant frames through multimodal information, enhancing the correlation between the acquired contextual information and the query. Second, it employs a dilated temporal expansion mechanism to mitigate the risk of missing crucial details when extracting information from these concentrated frames. In addition, our framework incorporates a self-reflection mechanism utilizing the confidence level of the model's responses as feedback. By iteratively applying these two creative focusing strategies, it adaptively adjusts attention to capture highly query-connected context and thus improves response accuracy. MCAF outperforms comparable state-of-the-art methods on average. On the EgoSchema dataset, it achieves a remarkable 5% performance gain over the leading approach. Meanwhile, on Next-QA and IntentQA datasets, it outperforms the current state-of-the-art standard by 0.2% and 0.3% respectively. On the Video-MME dataset, which features videos averaging nearly an hour in length, MCAF also outperforms other agent-based methods.
62. 【2504.17207】Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
链接:https://arxiv.org/abs/2504.17207
作者:Phillip Y. Lee,Jihyeon Je,Chanho Park,Mikaela Angelina Uy,Leonidas Guibas,Minhyuk Sung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mental imagery simulation, perspective-aware reasoning, imagery simulation, reasoning, Abstract Perspective Change
备注: Project Page: [this https URL](https://apc-vlm.github.io/)
点击查看摘要
Abstract:We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.
63. 【2504.17180】We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback
链接:https://arxiv.org/abs/2504.17180
作者:Minkyu Choi,S P Sharan,Harsh Goel,Sahil Shah,Sandeep Chinchali
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:increasingly popular due, produce coherent videos, increasingly popular, popular due, ability to produce
备注:
点击查看摘要
Abstract:Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce \(\projectname\), a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that \(\projectname\) significantly enhances temporal and logical alignment across diverse prompts by almost $40\%$.
64. 【2504.17179】AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models
链接:https://arxiv.org/abs/2504.17179
作者:Mohammad Zarei,Melanie A Jutras,Eliana Evans,Mike Tan,Omid Aaramoon
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Autonomous Vehicles, rely on artificial, artificial intelligence, interpret their surroundings, accurately detect objects
备注: 8 pages, 10 figures. Accepted to IEEE Conference on Artificial Intelligence (CAI), 2025
点击查看摘要
Abstract:Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately detect objects and interpret their surroundings. However, even when trained using millions of miles of real-world data, AVs are often unable to detect rare failure modes (RFMs). The problem of RFMs is commonly referred to as the "long-tail challenge", due to the distribution of data including many instances that are very rarely seen. In this paper, we present a novel approach that utilizes advanced generative and explainable AI techniques to aid in understanding RFMs. Our methods can be used to enhance the robustness and reliability of AVs when combined with both downstream model training and testing. We extract segmentation masks for objects of interest (e.g., cars) and invert them to create environmental masks. These masks, combined with carefully crafted text prompts, are fed into a custom diffusion model. We leverage the Stable Diffusion inpainting model guided by adversarial noise optimization to generate images containing diverse environments designed to evade object detection models and expose vulnerabilities in AI systems. Finally, we produce natural language descriptions of the generated RFMs that can guide developers and policymakers to improve the safety and reliability of AV systems.
65. 【2504.17177】A Genealogy of Multi-Sensor Foundation Models in Remote Sensing
链接:https://arxiv.org/abs/2504.17177
作者:Kevin Lane,Morteza Karimzadeh
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:minimal domain-specific modification, garnered increasing attention, primarily adopting approaches, Foundation models, primarily adopting
备注: 20 pages, submitted to ACM SigSpatial, currently under peer review
点击查看摘要
Abstract:Foundation models have garnered increasing attention for representation learning in remote sensing, primarily adopting approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches that each come with significant benefits and drawbacks. This paper examines these approaches along with their roots in the computer vision field in order to characterize potential advantages and pitfalls while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We place emphasis on the multi-sensor aspect of Earth observations, and the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.
66. 【2504.17163】PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
链接:https://arxiv.org/abs/2504.17163
作者:Kai Cui,Jia Li,Yu Liu,Xuesong Zhang,Zhenzhen Hu,Meng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:offering significant advantages, brain activity related, offering significant, facial expressions, Peripheral Physiological Signals
备注: The source code will be publicly available at [this https URL](https://github.com/MSA-LMC/PhysioSync)
点击查看摘要
Abstract:Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.
67. 【2504.17162】A Comprehensive Review on RNA Subcellular Localization Prediction
链接:https://arxiv.org/abs/2504.17162
作者:Cece Zhang,Xuehuan Zhu,Nick Peterson,Jieqiong Wang,Shibiao Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Subcellular Processes (q-bio.SC)
关键词:including long non-coding, RNA subcellular localization, long non-coding RNAs, RNA subcellular, subcellular localization
备注:
点击查看摘要
Abstract:The subcellular localization of RNAs, including long non-coding RNAs (lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs, plays a critical role in determining their biological functions. For instance, lncRNAs are predominantly associated with chromatin and act as regulators of gene transcription and chromatin structure, while mRNAs are distributed across the nucleus and cytoplasm, facilitating the transport of genetic information for protein synthesis. Understanding RNA localization sheds light on processes like gene expression regulation with spatial and temporal precision. However, traditional wet lab methods for determining RNA localization, such as in situ hybridization, are often time-consuming, resource-demanding, and costly. To overcome these challenges, computational methods leveraging artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternatives, enabling large-scale prediction of RNA subcellular localization. This paper provides a comprehensive review of the latest advancements in AI-based approaches for RNA subcellular localization prediction, covering various RNA types and focusing on sequence-based, image-based, and hybrid methodologies that combine both data types. We highlight the potential of these methods to accelerate RNA research, uncover molecular pathways, and guide targeted disease treatments. Furthermore, we critically discuss the challenges in AI/ML approaches for RNA subcellular localization, such as data scarcity and lack of benchmarks, and opportunities to address them. This review aims to serve as a valuable resource for researchers seeking to develop innovative solutions in the field of RNA subcellular localization and beyond.
68. 【2504.17160】OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection
链接:https://arxiv.org/abs/2504.17160
作者:Alberto Fernández-Hernández,Jose I. Mestre,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:Deep Neural Networks, Neural Networks, Deep Neural, dynamics of Deep, identifying optimal regularization
备注: 10 pages, 3 figures
点击查看摘要
Abstract:We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at this https URL.
69. 【2504.17132】Latent Video Dataset Distillation
链接:https://arxiv.org/abs/2504.17132
作者:Ning Li,Antai Andy Liu,Jingran Zhang,Justin Cui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable effectiveness, video dataset distillation, Dataset distillation, demonstrated remarkable, remarkable effectiveness
备注: [this https URL](https://openreview.net/forum?id=i665TIHv92)
点击查看摘要
Abstract:Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase.
70. 【2504.17111】ransferring Spatial Filters via Tangent Space Alignment in Motor Imagery BCIs
链接:https://arxiv.org/abs/2504.17111
作者:Tekin Gunasar,Virginia de Sa
类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:based spatial filter, common spatial patterns, motor imagery BCIs, aligning covariance matrices, Riemannian manifold
备注:
点击查看摘要
Abstract:We propose a method to improve subject transfer in motor imagery BCIs by aligning covariance matrices on a Riemannian manifold, followed by computing a new common spatial patterns (CSP) based spatial filter. We explore various ways to integrate information from multiple subjects and show improved performance compared to standard CSP. Across three datasets, our method shows marginal improvements over standard CSP; however, when training data are limited, the improvements become more significant.
71. 【2504.17076】Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection
链接:https://arxiv.org/abs/2504.17076
作者:Jens Petersen,Davide Abati,Amirhossein Habibian,Auke Wiggers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:objects, augmentation, augmented frames, training data augmentation, vision tasks
备注:
点击查看摘要
Abstract:Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene-aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to $2.8\times$ higher gains than the best competing approach ($+1.4$ vs. $+0.5$ mAP boost). We also demonstrate significant improvements for instance segmentation.
72. 【2504.17069】Distilling semantically aware orders for autoregressive image generation
链接:https://arxiv.org/abs/2504.17069
作者:Rishav Pramanik,Antoine Poupon,Juan A. Rodriguez,Masih Aminbeidokhti,David Vazquez,Christopher Pal,Zhaozheng Yin,Marco Pedersoli
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:recently shown competitive, shown competitive results, quality and scalability, Autoregressive patch-based image, recently shown
备注:
点击查看摘要
Abstract:Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.
73. 【2504.17067】PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation
链接:https://arxiv.org/abs/2504.17067
作者:Xinqi Xiong,Andrea Dunn Beltran,Jun Myeong Choi,Marc Niethammer,Roni Sengupta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enhances endoscopy navigation, estimation enhances endoscopy, Accurate depth estimation, navigation and diagnostics, settings is challenging
备注:
点击查看摘要
Abstract:Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at this https URL.
74. 【2504.17062】PBR: Extended PBR Materials in Image Synthesis
链接:https://arxiv.org/abs/2504.17062
作者:Yu Guo,Zhiqiang Lao,Xiyun Song,Yubin Zhou,Zongfang Lin,Heather Yu
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Physically Based Rendering, Realistic indoor, traditional Physically Based, vision and graphics, Based Rendering
备注: 8 pages without references, 7 figures, accepted in CVPRW 2025
点击查看摘要
Abstract:Realistic indoor or outdoor image synthesis is a core challenge in computer vision and graphics. The learning-based approach is easy to use but lacks physical consistency, while traditional Physically Based Rendering (PBR) offers high realism but is computationally expensive. Intrinsic image representation offers a well-balanced trade-off, decomposing images into fundamental components (intrinsic channels) such as geometry, materials, and illumination for controllable synthesis. However, existing PBR materials struggle with complex surface models, particularly high-specular and transparent surfaces. In this work, we extend intrinsic image representations to incorporate both reflection and transmission properties, enabling the synthesis of transparent materials such as glass and windows. We propose an explicit intrinsic compositing framework that provides deterministic, interpretable image synthesis. With the Extended PBR (ePBR) Materials, we can effectively edit the materials with precise controls.
75. 【2504.17040】DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
链接:https://arxiv.org/abs/2504.17040
作者:Zhenhailong Wang,Senthil Purushwalkam,Caiming Xiong,Silvio Savarese,Heng Ji,Ran Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:maintaining high task, burden of vision-language, maintaining high, Virtual Token Unmerging, Token
备注:
点击查看摘要
Abstract:We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: this https URL.
76. 【2504.17039】Dense Air Pollution Estimation from Sparse in-situ Measurements and Satellite Data
链接:https://arxiv.org/abs/2504.17039
作者:Ruben Gonzalez Avilés,Linus Scheibenreif,Damian Borth
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ambient Nitrogen Dioxide, estimating ambient Nitrogen, Nitrogen Dioxide, ambient Nitrogen, paper addresses
备注:
点击查看摘要
Abstract:This paper addresses the critical environmental challenge of estimating ambient Nitrogen Dioxide (NO$_2$) concentrations, a key issue in public health and environmental policy. Existing methods for satellite-based air pollution estimation model the relationship between satellite and in-situ measurements at select point locations. While these approaches have advanced our ability to provide air quality estimations on a global scale, they come with inherent limitations. The most notable limitation is the computational intensity required for generating comprehensive estimates over extensive areas. Motivated by these limitations, this study introduces a novel dense estimation technique. Our approach seeks to balance the accuracy of high-resolution estimates with the practicality of computational constraints, thereby enabling efficient and scalable global environmental assessment. By utilizing a uniformly random offset sampling strategy, our method disperses the ground truth data pixel location evenly across a larger patch. At inference, the dense estimation method can then generate a grid of estimates in a single step, significantly reducing the computational resources required to provide estimates for larger areas. Notably, our approach also surpasses the results of existing point-wise methods by a significant margin of $9.45\%$, achieving a Mean Absolute Error (MAE) of $4.98\ \mu\text{g}/\text{m}^3$. This demonstrates both high accuracy and computational efficiency, highlighting the applicability of our method for global environmental assessment. Furthermore, we showcase the method's adaptability and robustness by applying it to diverse geographic regions. Our method offers a viable solution to the computational challenges of large-scale environmental monitoring.
77. 【2504.16974】Seeing The Words: Evaluating AI-generated Biblical Art
链接:https://arxiv.org/abs/2504.16974
作者:Hidde Makimei,Shuai Wang,Willem van Peursen
类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Artificial Intelligence, past years witnessed, amount of Artificial, past years, years witnessed
备注:
点击查看摘要
Abstract:The past years witnessed a significant amount of Artificial Intelligence (AI) tools that can generate images from texts. This triggers the discussion of whether AI can generate accurate images using text from the Bible with respect to the corresponding biblical contexts and backgrounds. Despite some existing attempts at a small scale, little work has been done to systematically evaluate these generated images. In this work, we provide a large dataset of over 7K images using biblical text as prompts. These images were evaluated with multiple neural network-based tools on various aspects. We provide an assessment of accuracy and some analysis from the perspective of religion and aesthetics. Finally, we discuss the use of the generated images and reflect on the performance of the AI generators.
78. 【2504.16972】Unsupervised Time-Series Signal Analysis with Autoencoders and Vision Transformers: A Review of Architectures and Applications
链接:https://arxiv.org/abs/2504.16972
作者:Hossein Ahmadi,Sajjad Emdadi Mahdimahalleh,Arman Farahat,Banafsheh Saffari
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Internet of Things, unlabeled time-series data, biomedical engineering, wireless communications, rapid growth
备注:
点击查看摘要
Abstract:The rapid growth of unlabeled time-series data in domains such as wireless communications, radar, biomedical engineering, and the Internet of Things (IoT) has driven advancements in unsupervised learning. This review synthesizes recent progress in applying autoencoders and vision transformers for unsupervised signal analysis, focusing on their architectures, applications, and emerging trends. We explore how these models enable feature extraction, anomaly detection, and classification across diverse signal types, including electrocardiograms, radar waveforms, and IoT sensor data. The review highlights the strengths of hybrid architectures and self-supervised learning, while identifying challenges in interpretability, scalability, and domain generalization. By bridging methodological innovations and practical applications, this work offers a roadmap for developing robust, adaptive models for signal intelligence.
79. 【2504.16942】S2Vec: Self-Supervised Geospatial Embeddings
链接:https://arxiv.org/abs/2504.16942
作者:Shushman Choudhury,Elad Aharoni,Chandrakumari Suvarna,Iveel Tsogsuren,Abdul Rahman Kreidieh,Chun-Ta Lu,Neha Arora
类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Scalable general-purpose representations, Scalable general-purpose, Scalable, artificial intelligence applications, built environment
备注: To be submitted to ACM Transactions on Spatial Algorithms and Systems
点击查看摘要
Abstract:Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on three large-scale socioeconomic prediction tasks, showing its competitive performance against state-of-the-art image-based embeddings. We also explore the benefits of combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our results highlight how S2Vec can learn effective general-purpose geospatial representations and how it can complement other data modalities in geospatial artificial intelligence.
80. 【2504.16936】Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness
链接:https://arxiv.org/abs/2504.16936
作者:Yusheng Zhao,Junyu Luo,Xiao Luo,Weizhi Zhang,Zhiping Xiao,Wei Ju,Philip S. Yu,Ming Zhang
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Multi-modal large language, large language models, Multi-modal large, recently achieved great, large language
备注:
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
81. 【2504.17710】Plasma State Monitoring and Disruption Characterization using Multimodal VAEs
链接:https://arxiv.org/abs/2504.17710
作者:Yoeri Poels,Alessandro Pau,Christian Donner,Giulio Romanelli,Olivier Sauter,Cristina Venturini,Vlado Menkovski, theTCV team, theWPTE team
类目:Plasma Physics (physics.plasm-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:surrounding device components, significant heat, heat and electromagnetic, electromagnetic loads, loads are deposited
备注:
点击查看摘要
Abstract:When a plasma disrupts in a tokamak, significant heat and electromagnetic loads are deposited onto the surrounding device components. These forces scale with plasma current and magnetic field strength, making disruptions one of the key challenges for future devices. Unfortunately, disruptions are not fully understood, with many different underlying causes that are difficult to anticipate. Data-driven models have shown success in predicting them, but they only provide limited interpretability. On the other hand, large-scale statistical analyses have been a great asset to understanding disruptive patterns. In this paper, we leverage data-driven methods to find an interpretable representation of the plasma state for disruption characterization. Specifically, we use a latent variable model to represent diagnostic measurements as a low-dimensional, latent representation. We build upon the Variational Autoencoder (VAE) framework, and extend it for (1) continuous projections of plasma trajectories; (2) a multimodal structure to separate operating regimes; and (3) separation with respect to disruptive regimes. Subsequently, we can identify continuous indicators for the disruption rate and the disruptivity based on statistical properties of measurement data. The proposed method is demonstrated using a dataset of approximately 1600 TCV discharges, selecting for flat-top disruptions or regular terminations. We evaluate the method with respect to (1) the identified disruption risk and its correlation with other plasma properties; (2) the ability to distinguish different types of disruptions; and (3) downstream analyses. For the latter, we conduct a demonstrative study on identifying parameters connected to disruptions using counterfactual-like analysis. Overall, the method can adequately identify distinct operating regimes characterized by varying proximity to disruptions in an interpretable manner.
82. 【2504.17628】Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided Customization
链接:https://arxiv.org/abs/2504.17628
作者:Abderrachid Hamrani,Daniela Leizaola,Renato Sousa,Jose P. Ponce,Stanley Mathis,David G. Armstrong,Anuradha Godavarty
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Diabetic foot ulcers, enhance patient outcomes, Diabetic foot, Zero-shot Unsupervised System, Attention Diffusion Zero-shot
备注: 12 pages, 8 figures, journal article
点击查看摘要
Abstract:Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare, requiring precise and efficient wound assessment to enhance patient outcomes. This study introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel text-guided diffusion model that performs wound segmentation without relying on labeled training data. Unlike conventional deep learning models, which require extensive annotation, ADZUS leverages zero-shot learning to dynamically adapt segmentation based on descriptive prompts, offering enhanced flexibility and adaptability in clinical applications. Experimental evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art segmentation models, achieving an IoU of 86.68\% and the highest precision of 94.69\% on the chronic wound dataset, outperforming supervised approaches such as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing FUSegNet's 45\%. The model's text-guided segmentation capability enables real-time customization of segmentation outputs, allowing targeted analysis of wound characteristics based on clinical descriptions. Despite its competitive performance, the computational cost of diffusion-based inference and the need for potential fine-tuning remain areas for future improvement. ADZUS represents a transformative step in wound segmentation, providing a scalable, efficient, and adaptable AI-driven solution for medical imaging.
83. 【2504.17379】A Spatially-Aware Multiple Instance Learning Framework for Digital Pathology
链接:https://arxiv.org/abs/2504.17379
作者:Hassan Keshvarikhojasteh,Mihail Tifrea,Sibylle Hess,Josien P.W. Pluim,Mitko Veta
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multiple instance learning, Deep Multiple Instance, weakly supervised classification, instance learning, Multiple instance
备注:
点击查看摘要
Abstract:Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks.
84. 【2504.17122】Physiological neural representation for personalised tracer kinetic parameter estimation from dynamic PET
链接:https://arxiv.org/abs/2504.17122
作者:Kartikay Tehlan,Thomas Wendler
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:positron emission tomography, enables non-invasive quantification, FDG enables non-invasive, Dynamic positron emission, two-tissue compartment model
备注: The code is available at: [this https URL](https://github.com/tkartikay/PhysNRPET)
点击查看摘要
Abstract:Dynamic positron emission tomography (PET) with [$^{18}$F]FDG enables non-invasive quantification of glucose metabolism through kinetic analysis, often modelled by the two-tissue compartment model (TCKM). However, voxel-wise kinetic parameter estimation using conventional methods is computationally intensive and limited by spatial resolution. Deep neural networks (DNNs) offer an alternative but require large training datasets and significant computational resources. To address these limitations, we propose a physiological neural representation based on implicit neural representations (INRs) for personalized kinetic parameter estimation. INRs, which learn continuous functions, allow for efficient, high-resolution parametric imaging with reduced data requirements. Our method also integrates anatomical priors from a 3D CT foundation model to enhance robustness and precision in kinetic modelling. We evaluate our approach on an [$^{18}$F]FDG dynamic PET/CT dataset and compare it to state-of-the-art DNNs. Results demonstrate superior spatial resolution, lower mean-squared error, and improved anatomical consistency, particularly in tumour and highly vascularized regions. Our findings highlight the potential of INRs for personalized, data-efficient tracer kinetic modelling, enabling applications in tumour characterization, segmentation, and prognostic assessment.
85. 【2504.17114】Anatomy-constrained modelling of image-derived input functions in dynamic PET using multi-organ segmentation
链接:https://arxiv.org/abs/2504.17114
作者:Valentin Langer,Kartikay Tehlan,Thomas Wendler
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:requires anatomically constrained, positron emission tomography, image-derived input functions, Accurate kinetic analysis, anatomically constrained modelling
备注: The code is available under [this https URL](https://github.com/tinolan/curve_fit_multi_idif)
点击查看摘要
Abstract:Accurate kinetic analysis of [$^{18}$F]FDG distribution in dynamic positron emission tomography (PET) requires anatomically constrained modelling of image-derived input functions (IDIFs). Traditionally, IDIFs are obtained from the aorta, neglecting anatomical variations and complex vascular contributions. This study proposes a multi-organ segmentation-based approach that integrates IDIFs from the aorta, portal vein, pulmonary artery, and ureters. Using high-resolution CT segmentations of the liver, lungs, kidneys, and bladder, we incorporate organ-specific blood supply sources to improve kinetic modelling. Our method was evaluated on dynamic [$^{18}$F]FDG PET data from nine patients, resulting in a mean squared error (MSE) reduction of $13.39\%$ for the liver and $10.42\%$ for the lungs. These initial results highlight the potential of multiple IDIFs in improving anatomical modelling and fully leveraging dynamic PET imaging. This approach could facilitate the integration of tracer kinetic modelling into clinical routine.
86. 【2504.16979】Automating tumor-infiltrating lymphocyte assessment in breast cancer histopathology images using QuPath: a transparent and accessible machine learning pipeline
链接:https://arxiv.org/abs/2504.16979
作者:Masoud Tafavvoghi,Lars Ailo Bongo,André Berli Delgado,Nikita Shvetsov,Anders Sildnes,Line Moi,Lill-Tove Rasmussen Busund,Kajsa Møllersen
类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:fully automatic fashion, easily accessible tools, perform complex tasks, tumor-infiltrating lymphocytes, demonstrating the potential
备注: 16 Pages, 9 Figures, 3 tables
点击查看摘要
Abstract:In this study, we built an end-to-end tumor-infiltrating lymphocytes (TILs) assessment pipeline within QuPath, demonstrating the potential of easily accessible tools to perform complex tasks in a fully automatic fashion. First, we trained a pixel classifier to segment tumor, tumor-associated stroma, and other tissue compartments in breast cancer HE-stained whole-slide images (WSI) to isolate tumor-associated stroma for subsequent analysis. Next, we applied a pre-trained StarDist deep learning model in QuPath for cell detection and used the extracted cell features to train a binary classifier distinguishing TILs from other cells. To evaluate our TILs assessment pipeline, we calculated the TIL density in each WSI and categorized them as low, medium, or high TIL levels. Our pipeline was evaluated against pathologist-assigned TIL scores, achieving a Cohen's kappa of 0.71 on the external test set, corroborating previous research findings. These results confirm that existing software can offer a practical solution for the assessment of TILs in HE-stained WSIs of breast cancer.
87. 【2504.16940】Can deep neural networks learn biological vision?
链接:https://arxiv.org/abs/2504.16940
作者:Drew Linsley,Pinyuan Feng,Thomas Serre
类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, showed increasing alignment, primate neural responses, neural networks, neural responses
备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) once showed increasing alignment with primate neural responses as they improved on computer vision benchmarks. This trend raised the exciting possibility that better models of biological vision would come as a byproduct of the deep learning revolution in artificial intelligence. However, the trend has reversed over recent years as DNNs have scaled to human or superhuman recognition accuracy, a divergence that may stem from modern DNNs learning to rely on different visual features than primates to solve tasks. Where will better computational models of biological vision come from? We propose that vision science must break from artificial intelligence to develop algorithms that are designed with biological visual systems in mind instead of internet data benchmarks. We predict that the next generation of deep learning models of biological vision will be trained with data diets, training routines, and objectives that are closer to those that shape human vision than those that are in use today.