本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新357篇论文，其中：

自然语言处理78篇
信息检索8篇
计算机视觉62篇

自然语言处理

1. 【2505.00703】2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

作者：Dongzhi Jiang,Ziyu Guo,Renrui Zhang,Zhuofan Zong,Hao Li,Le Zhuo,Shilin Yan,Pheng-Ann Heng,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Recent advancements, large language models, reinforcement learning, advancements in large, large language

备注： Project Page: [this https URL](https://github.com/CaraJ7/T2I-R1)

点击查看摘要

Abstract:Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: this https URL

2. 【2505.00679】Steering Large Language Models with Register Analysis for Arbitrary Style Transfer

链接：https://arxiv.org/abs/2505.00679

作者：Xinchen Yang,Marine Carpuat

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, demonstrated strong capabilities, demonstrated strong

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in rewriting text across various styles. However, effectively leveraging this ability for example-based arbitrary style transfer, where an input text is rewritten to match the style of a given exemplar, remains an open challenge. A key question is how to describe the style of the exemplar to guide LLMs toward high-quality rewrites. In this work, we propose a prompting method based on register analysis to guide LLMs to perform this task. Empirical evaluations across multiple style transfer tasks show that our prompting approach enhances style transfer strength while preserving meaning more effectively than existing prompting strategies.

3. 【2505.00675】Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

链接：https://arxiv.org/abs/2505.00675

作者：Yiming Du,Wenyu Huang,Danna Zheng,Zhaowei Wang,Sebastien Montella,Mirella Lapata,Kam-Fai Wong,Jeff Z. Pan

类目：Computation and Language (cs.CL)

关键词：underpinning large language, large language models, underpinning large, language models, large language

备注：

点击查看摘要

Abstract:Memory is a fundamental component of AI systems, underpinning large language models (LLMs) based agents. While prior surveys have focused on memory applications with LLMs, they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric, contextual structured, and contextual unstructured and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We systematically map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnote{The paper list, datasets, methods and tools are available at \href{this https URL}{this https URL\_Memory\_in\_AI}.}.

4. 【2505.00662】DeepCritic: Deliberate Critique with Large Language Models

链接：https://arxiv.org/abs/2505.00662

作者：Wenkai Yang,Jingwen Chen,Yankai Lin,Ji-Rong Wen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, providing accurate feedback, Language Models, rapidly evolving

备注： Work in progress. Data and models are available at [this https URL](https://github.com/RUCBM/DeepCritic)

点击查看摘要

Abstract:As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

5. 【2505.00661】On the generalization of language models from in-context learning and finetuning: a controlled study

链接：https://arxiv.org/abs/2505.00661

作者：Andrew K. Lampinen,Arslan Chaudhry,Stephanie C.Y. Chan,Cody Wild,Diane Wan,Alex Ku,Jörg Bornschein,Razvan Pascanu,Murray Shanahan,James L. McClelland

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：exhibit exciting capabilities, missing logical deductions, show surprisingly narrow, models exhibit exciting, surprisingly narrow generalization

备注：

点击查看摘要

Abstract:Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning -- from failing to generalize to simple reversals of relations they are trained on, to missing logical deductions that can be made from trained information. These failures to generalize from fine-tuning can hinder practical application of these models. However, language models' in-context learning shows different inductive biases, and can generalize better in some of these cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' ability to generalize from finetuning data. The datasets are constructed to isolate the knowledge in the dataset from that in pretraining, to create clean tests of generalization. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.

6. 【2505.00654】Large Language Models Understanding: an Inherent Ambiguity Barrier

链接：https://arxiv.org/abs/2505.00654

作者：Daniel N. Nissani(Nissensohn)

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Language Models, lively ongoing debate, Large Language, emergence of Large

备注： submitted to NEURAL COMPUTATION

点击查看摘要

Abstract:A lively ongoing debate is taking place, since the extraordinary emergence of Large Language Models (LLMs) with regards to their capability to understand the world and capture the meaning of the dialogues in which they are involved. Arguments and counter-arguments have been proposed based upon thought experiments, anecdotal conversations between LLMs and humans, statistical linguistic analysis, philosophical considerations, and more. In this brief paper we present a counter-argument based upon a thought experiment and semi-formal considerations leading to an inherent ambiguity barrier which prevents LLMs from having any understanding of what their amazingly fluent dialogues mean.

7. 【2505.00649】Investigating Task Arithmetic for Zero-Shot Information Retrieval

链接：https://arxiv.org/abs/2505.00649

作者：Marco Braga,Pranav Kasela,Alessandro Raganato,Gabriella Pasi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Natural Language Processing, Language Processing tasks, Large Language Models, Large Language, Natural Language

备注： Accepted in SIGIR '25

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive zero-shot performance across a variety of Natural Language Processing tasks, including document re-ranking. However, their effectiveness degrades on unseen tasks and domains, largely due to shifts in vocabulary and word distributions. In this paper, we investigate Task Arithmetic, a technique that combines the weights of LLMs pre-trained on different tasks or domains via simple mathematical operations, such as addition or subtraction, to adapt retrieval models without requiring additional fine-tuning. Our method is able to synthesize diverse tasks and domain knowledge into a single model, enabling effective zero-shot adaptation in different retrieval contexts. Extensive experiments on publicly available scientific, biomedical, and multilingual datasets show that our method improves state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in P@10. In addition to these empirical gains, our analysis provides insights into the strengths and limitations of Task Arithmetic as a practical strategy for zero-shot learning and model adaptation. We make our code publicly available at this https URL.

8. 【2505.00626】he Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

链接：https://arxiv.org/abs/2505.00626

作者：Zihao Wang,Yibo Jiang,Jiahao Yu,Heqing Huang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：external tool outputs, Large language models, Large language, integrate multiple input, external tool

备注：

点击查看摘要

Abstract:Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing \emph{invariant signals} that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, manipulating position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

9. 【2505.00624】FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation

链接：https://arxiv.org/abs/2505.00624

作者：Chaitali Bhattacharyya,Yeseong Kim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Training large language, significant computational resources, scratch requires significant, requires significant computational, Training large

备注：

点击查看摘要

Abstract:Training large language models (LLMs) from scratch requires significant computational resources, driving interest in developing smaller, domain-specific LLMs that maintain both efficiency and strong task performance. Medium-sized models such as LLaMA, llama} have served as starting points for domain-specific adaptation, but they often suffer from accuracy degradation when tested on specialized datasets. We introduce FineScope, a framework for deriving compact, domain-optimized LLMs from larger pretrained models. FineScope leverages the Sparse Autoencoder (SAE) framework, inspired by its ability to produce interpretable feature representations, to extract domain-specific subsets from large datasets. We apply structured pruning with domain-specific constraints, ensuring that the resulting pruned models retain essential knowledge for the target domain. To further enhance performance, these pruned models undergo self-data distillation, leveraging SAE-curated datasets to restore key domain-specific information lost during pruning. Extensive experiments and ablation studies demonstrate that FineScope achieves highly competitive performance, outperforming several large-scale state-of-the-art LLMs in domain-specific tasks. Additionally, our results show that FineScope enables pruned models to regain a substantial portion of their original performance when fine-tuned with SAE-curated datasets. Furthermore, applying these datasets to fine-tune pretrained LLMs without pruning also improves their domain-specific accuracy, highlighting the robustness of our approach. The code will be released.

10. 【2505.00582】Block Circulant Adapter for Large Language Models

链接：https://arxiv.org/abs/2505.00582

作者：Xinyu Ding,Meiqi Wang,Siyu Liao,Zhongfeng Wang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：huge model size, difficult due, Fine-tuning large language, times, Recent Fourier domain-based

备注： to appear in Proceedings of the 2025 International Joint Conference on Artificial Intelligence (IJCAI-2025)

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs. Experiments show that our method uses $14\times$ less number of parameters than VeRA, $16\times$ smaller than LoRA and $32\times$ less FLOPs than FourierFT, while maintaining close or better task performance. Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.

11. 【2505.00570】FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension

链接：https://arxiv.org/abs/2505.00570

作者：Jushi Kai,Boyi Zeng,Yixuan Wang,Haoli Bai,Bo Jiang,Zhouhan Lin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：long-form content generation, applications involving long-form, involving long-form content, large language models, content generation

备注：

点击查看摘要

Abstract:Extending the context window in large language models (LLMs) is essential for applications involving long-form content generation. However, the linear increase in key-value (KV) cache memory requirements and the quadratic complexity of self-attention with respect to sequence length present significant challenges during fine-tuning and inference. Existing methods suffer from performance degradation when extending to longer contexts. In this work, we introduce a novel context extension method that optimizes both fine-tuning and inference efficiency. Our method exploits a key observation: in the frequency domain, the energy distribution of the KV cache is primarily concentrated in low-frequency components. By filtering out the high-frequency components, the KV cache can be effectively compressed with minimal information loss. Building on this insight, we propose an efficient compression technique, FreqKV, that iteratively compresses the increasing KV cache to a fixed size in the frequency domain, applicable to both fine-tuning and inference. FreqKV introduces no additional parameters or architectural modifications. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window efficiently. Experiments on various long context language modeling and understanding tasks demonstrate the efficiency and efficacy of the proposed method.

12. 【2505.00557】riggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models

链接：https://arxiv.org/abs/2505.00557

作者：Makoto Sato

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, present a growing, real-world applications, healthcare to law, reliability is essential

备注：

点击查看摘要

Abstract:Hallucinations in large language models (LLMs) present a growing challenge across real-world applications, from healthcare to law, where factual reliability is essential. Despite advances in alignment and instruction tuning, LLMs can still generate outputs that are fluent yet fundamentally untrue. Understanding the cognitive dynamics that underlie these hallucinations remains an open problem. In this study, we propose a prompt-based framework to systematically trigger and quantify hallucination: a Hallucination-Inducing Prompt (HIP), which synthetically fuses semantically distant concepts (e.g., periodic table of elements and tarot divination) in a misleading way, and a Hallucination Quantifying Prompt (HQP), which scores the plausibility, confidence, and coherence of the output. Controlled experiments across multiple LLMs revealed that HIPs consistently produced less coherent and more hallucinated responses than their null-fusion controls. These effects varied across models, with reasoning-oriented LLMs showing distinct profiles from general-purpose ones. Our framework provides a reproducible testbed for studying hallucination vulnerability, and opens the door to developing safer, more introspective LLMs that can detect and self-regulate the onset of conceptual instability.

13. 【2505.00551】100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

链接：https://arxiv.org/abs/2505.00551

作者：Chong Zhang,Yue Deng,Xiang Lin,Bin Wang,Dianwen Ng,Hai Ye,Xingxuan Li,Yao Xiao,Zhanfeng Mo,Qi Zhang,Lidong Bing

类目：Computation and Language (cs.CL)

关键词：large language models, reasoning language models, language models, evolution in large, large language

备注：

点击查看摘要

Abstract:The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.

14. 【2505.00506】HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

链接：https://arxiv.org/abs/2505.00506

作者：Deanna Emery,Michael Goitia,Freddie Vargus,Iulia Neagu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：detecting hallucinated content, large language models, language models, detecting hallucinated, hallucinated content

备注：

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content$\unicode{x2013}$text that is not grounded in supporting evidence$\unicode{x2013}$has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems$\unicode{x2013}$both open and closed source$\unicode{x2013}$highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.

15. 【2505.00479】Computational Identification of Regulatory Statements in EU Legislation

链接：https://arxiv.org/abs/2505.00479

作者：Gijs Jan Brandsma,Jens Blom-Hansen,Christiaan Meijer,Kody Moodley

类目：Computation and Language (cs.CL)

关键词：developing metrics, metrics to measure, density and strictness, Identifying regulatory statements, regulatory density

备注： 11 pages, 6 figures

点击查看摘要

Abstract:Identifying regulatory statements in legislation is useful for developing metrics to measure the regulatory density and strictness of legislation. A computational method is valuable for scaling the identification of such statements from a growing body of EU legislation, constituting approximately 180,000 published legal acts between 1952 and 2023. Past work on extraction of these statements varies in the permissiveness of their definitions for what constitutes a regulatory statement. In this work, we provide a specific definition for our purposes based on the institutional grammar tool. We develop and compare two contrasting approaches for automatically identifying such statements in EU legislation, one based on dependency parsing, and the other on a transformer-based machine learning model. We found both approaches performed similarly well with accuracies of 80% and 84% respectively and a K alpha of 0.58. The high accuracies and not exceedingly high agreement suggests potential for combining strengths of both approaches.

16. 【2505.00467】Red Teaming Large Language Models for Healthcare

链接：https://arxiv.org/abs/2505.00467

作者：Vahid Balazadeh,Michael Cooper,David Pellow,Atousa Assadi,Jennifer Bell,Jim Fackler,Gabriel Funingana,Spencer Gable-Cook,Anirudh Gangadhar,Abhishek Jaiswal,Sumanth Kaja,Christopher Khoury,Randy Lin,Kaden McKeen,Sara Naimimohasses,Khashayar Namdar,Aviraj Newatia,Allan Pang,Anshul Pattoo,Sameer Peesapati,Diana Prepelita,Bogdana Rakova,Saba Sadatamin,Rafael Schulman,Ajay Shah,Syed Azhar Shah,Syed Ahmar Shah,Babak Taati,Balagopal Unnikrishnan,Stephanie Williams,Rahul G Krishnan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：entitled Red Teaming, Red Teaming Large, Teaming Large Language, Large Language Models, Learning for Healthcare

备注：

点击查看摘要

Abstract:We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities -- realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.

17. 【2505.00422】oward Automated Regulatory Decision-Making: Trustworthy Medical Device Risk Classification with Multimodal Transformers and Self-Training

链接：https://arxiv.org/abs/2505.00422

作者：Yu Han,Aaron Ceross,Jeroen H.M. Bergmann

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：medical device risk, device risk levels, Accurate classification, clinical safety, device regulatory classification

备注：

点击查看摘要

Abstract:Accurate classification of medical device risk levels is essential for regulatory oversight and clinical safety. We present a Transformer-based multimodal framework that integrates textual descriptions and visual information to predict device regulatory classification. The model incorporates a cross-attention mechanism to capture intermodal dependencies and employs a self-training strategy for improved generalization under limited supervision. Experiments on a real-world regulatory dataset demonstrate that our approach achieves up to 90.4% accuracy and 97.9% AUROC, significantly outperforming text-only (77.2%) and image-only (54.8%) baselines. Compared to standard multimodal fusion, the self-training mechanism improved SVM performance by 3.3 percentage points in accuracy (from 87.1% to 90.4%) and 1.4 points in macro-F1, suggesting that pseudo-labeling can effectively enhance generalization under limited supervision. Ablation studies further confirm the complementary benefits of both cross-modal attention and self-training.

18. 【2505.00389】CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass

链接：https://arxiv.org/abs/2505.00389

作者：Bowen Zhang,Zixin Song,Chunping Li

类目：Computation and Language (cs.CL)

关键词：Information Retrieval, task in Information, Computational Linguistics, Retrieval and Computational, content analysis

备注： Accepted by SIGIR 2025 (Full)

点击查看摘要

Abstract:As a fundamental task in Information Retrieval and Computational Linguistics, sentence representation has profound implications for a wide range of practical applications such as text clustering, content analysis, question-answering systems, and web search. Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field, particularly through unsupervised embedding derivation methods centered on discriminative PLMs like BERT. However, due to time and computational constraints, few efforts have attempted to integrate unsupervised sentence representation with generative PLMs, which typically possess much larger parameter sizes. Given that state-of-the-art models in both academia and industry are predominantly based on generative architectures, there is a pressing need for an efficient unsupervised text representation framework tailored to decoder-only PLMs. To address this concern, we propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models. Compared to existing strategies, CSE-SFP requires only a single forward pass to perform effective unsupervised contrastive learning. Rigorous experimentation demonstrates that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption. Furthermore, we introduce two ratio metrics that jointly assess alignment and uniformity, thereby providing a more robust means for evaluating the semantic spatial properties of encoding models.

19. 【2505.00367】KoACD: The First Korean Adolescent Dataset for Cognitive Distortion Analysis

链接：https://arxiv.org/abs/2505.00367

作者：JunSeo Kim,HyeHyeon Kim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：negative thinking patterns, mental health issues, refers to negative, negative thinking, thinking patterns

备注：

点击查看摘要

Abstract:Cognitive distortion refers to negative thinking patterns that can lead to mental health issues like depression and anxiety in adolescents. Previous studies using natural language processing (NLP) have focused mainly on small-scale adult datasets, with limited research on adolescents. This study introduces KoACD, the first large-scale dataset of cognitive distortions in Korean adolescents, containing 108,717 instances. We applied a multi-Large Language Model (LLM) negotiation method to refine distortion classification and generate synthetic data using two approaches: cognitive clarification for textual clarity and cognitive balancing for diverse distortion representation. Validation through LLMs and expert evaluations showed that while LLMs classified distortions with explicit markers, they struggled with context-dependent reasoning, where human evaluators demonstrated higher accuracy. KoACD aims to enhance future research on cognitive distortion detection.

20. 【2505.00358】RB: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

链接：https://arxiv.org/abs/2505.00358

作者：Albert Ge,Tzu-Heng Huang,John Cooper,Avi Trost,Ziyi Chu,Satya Sai Srinath Namburi GNVV,Ziyang Cai,Kendall Park,Nicholas Roberts,Frederic Sala

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：successfully reduced, reduced the costs, costs involved, training language models, Data

备注：

点击查看摘要

Abstract:Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via RB, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify RB's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of RB on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, RB matches or exceeds the performance of state-of-the-art data mixing strategies.

21. 【2505.00339】Enhancing AI-Driven Education: Integrating Cognitive Frameworks, Linguistic Feedback Analysis, and Ethical Considerations for Improved Content Generation

链接：https://arxiv.org/abs/2505.00339

作者：Antoun Yaacoub,Sansiri Tarnpradab,Phattara Khumprom,Zainab Assaghir,Lionel Prevost,Jérôme Da-Rugna

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：rapidly transforming education, presenting unprecedented opportunities, Artificial intelligence, streamlined content creation, transforming education

备注： This article will be presented in IJCNN 2025 "AI Innovations for Education: Transforming Teaching and Learning through Cutting-Edge Technologies" workshop

点击查看摘要

Abstract:Artificial intelligence (AI) is rapidly transforming education, presenting unprecedented opportunities for personalized learning and streamlined content creation. However, realizing the full potential of AI in educational settings necessitates careful consideration of the quality, cognitive depth, and ethical implications of AI-generated materials. This paper synthesizes insights from four related studies to propose a comprehensive framework for enhancing AI-driven educational tools. We integrate cognitive assessment frameworks (Bloom's Taxonomy and SOLO Taxonomy), linguistic analysis of AI-generated feedback, and ethical design principles to guide the development of effective and responsible AI tools. We outline a structured three-phase approach encompassing cognitive alignment, linguistic feedback integration, and ethical safeguards. The practical application of this framework is demonstrated through its integration into OneClickQuiz, an AI-powered Moodle plugin for quiz generation. This work contributes a comprehensive and actionable guide for educators, researchers, and developers aiming to harness AI's potential while upholding pedagogical and ethical standards in educational content generation.

22. 【2505.00337】2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

链接：https://arxiv.org/abs/2505.00337

作者：Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：user engagement online, made significant strides, digital art creation, producing high-quality videos, recent years

备注：

点击查看摘要

Abstract:Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.

23. 【2505.00315】Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

链接：https://arxiv.org/abs/2505.00315

作者：Piotr Piękos,Róbert Csordás,Jürgen Schmidhuber

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：excessive quadratic cost, Recent advances, large language models, language models highlighted, advances in large

备注：

点击查看摘要

Abstract:Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting $k$ tokens from a sequence of length $T$, MoSA reduces the computational complexity of each attention head from $O(T^2)$ to $O(k^2 + T)$. This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget. MoSA can also reduce the resource usage compared to dense self-attention. Despite using torch implementation without an optimized kernel, perplexity-matched MoSA models are simultaneously faster in wall-clock time, require less memory for training, and drastically reduce the size of the KV-cache compared to the dense transformer baselines.

24. 【2505.00268】Consistency in Language Models: Current Landscape, Challenges, and Future Directions

链接：https://arxiv.org/abs/2505.00268

作者：Jekaterina Novikova,Carol Anderson,Borhane Blili-Hamelin,Subhabrata Majumdar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：expressing similar meanings, expressing similar, avoiding contradictions, similar meanings, similar contexts

备注：

点击查看摘要

Abstract:The hallmark of effective language use lies in consistency -- expressing similar meanings in similar contexts and avoiding contradictions. While human communication naturally demonstrates this principle, state-of-the-art language models struggle to maintain reliable consistency across different scenarios. This paper examines the landscape of consistency research in AI language systems, exploring both formal consistency (including logical rule adherence) and informal consistency (such as moral and factual coherence). We analyze current approaches to measure aspects of consistency, identify critical research gaps in standardization of definitions, multilingual assessment, and methods to improve consistency. Our findings point to an urgent need for robust benchmarks to measure and interdisciplinary approaches to ensure consistency in the application of language models on domain-specific tasks while preserving the utility and adaptability.

25. 【2505.00263】EnronQA: Towards Personalized RAG over Private Documents

链接：https://arxiv.org/abs/2505.00263

作者：Michael J. Ryan,Danmei Xu,Chris Nivera,Daniel Campos

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Retrieval Augmented Generation, Augmented Generation, large language models, bringing knowledge-intensive context, bring local context

备注： 26 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has become one of the most popular methods for bringing knowledge-intensive context to large language models (LLM) because of its ability to bring local context at inference time without the cost or data leakage risks associated with fine-tuning. A clear separation of private information from the LLM training has made RAG the basis for many enterprise LLM workloads as it allows the company to augment LLM's understanding using customers' private documents. Despite its popularity for private documents in enterprise deployments, current RAG benchmarks for validating and optimizing RAG pipelines draw their corpora from public data such as Wikipedia or generic web pages and offer little to no personal context. Seeking to empower more personal and private RAG we release the EnronQA benchmark, a dataset of 103,638 emails with 528,304 question-answer pairs across 150 different user inboxes. EnronQA enables better benchmarking of RAG pipelines over private data and allows for experimentation on the introduction of personalized retrieval settings over realistic data. Finally, we use EnronQA to explore the tradeoff in memorization and retrieval when reasoning over private documents.

26. 【2505.00261】Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring

链接：https://arxiv.org/abs/2505.00261

作者：Jayoung Song,KyungTae Lim,Jungyeul Park

类目：Computation and Language (cs.CL)

关键词：growing global interest, learner corpora tailored, growing global, global interest, remains a significant

备注：

点击查看摘要

Abstract:Despite growing global interest in Korean language education, there remains a significant lack of learner corpora tailored to Korean L2 writing. To address this gap, we enhance the KoLLA Korean learner corpus by adding multiple grammatical error correction (GEC) references, thereby enabling more nuanced and flexible evaluation of GEC systems, and reflects the variability of human language. Additionally, we enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute, capturing grammatical accuracy, coherence, and lexical diversity. These enhancements make KoLLA a robust and standardized resource for research in Korean L2 education, supporting advancements in language learning, assessment, and automated error correction.

27. 【2505.00234】Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

链接：https://arxiv.org/abs/2505.00234

作者：Vishnu Sarukkai,Zhiqiang Xie,Kayvon Fatahalian

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large Language Model, improving Large Language, Language Model, Large Language, improving Large

备注：

点击查看摘要

Abstract:Many methods for improving Large Language Model (LLM) agents for sequential decision-making tasks depend on task-specific knowledge engineering--such as prompt tuning, curated in-context examples, or customized observation and action spaces. Using these approaches, agent performance improves with the quality or amount of knowledge engineering invested. Instead, we investigate how LLM agents can automatically improve their performance by learning in-context from their own successful experiences on similar tasks. Rather than relying on task-specific knowledge engineering, we focus on constructing and refining a database of self-generated examples. We demonstrate that even a naive accumulation of successful trajectories across training tasks boosts test performance on three benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%), and InterCode-SQL (75% to 79%)--matching the performance the initial agent achieves if allowed two to three attempts per task. We then introduce two extensions: (1) database-level selection through population-based training to identify high-performing example collections, and (2) exemplar-level selection that retains individual trajectories based on their empirical utility as in-context examples. These extensions further enhance performance, achieving 91% on ALFWorld--matching more complex approaches that employ task-specific components and prompts. Our results demonstrate that automatic trajectory database construction offers a compelling alternative to labor-intensive knowledge engineering.

28. 【2505.00212】Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

链接：https://arxiv.org/abs/2505.00212

作者：Shaokun Zhang,Ming Yin,Jieyu Zhang,Jiale Liu,Zhiguang Han,Jingyang Zhang,Beibin Li,Chi Wang,Huazheng Wang,Yiran Chen,Qingyun Wu

类目：Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词：LLM multi-agent systems, LLM multi-agent systems-identifying, failures-provides crucial clues, LLM multi-agent, task failures-provides crucial

备注：

点击查看摘要

Abstract:Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the WhoWhen dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the WhoWhen, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at this https URL

29. 【2505.00191】IP-CRR: Information Pursuit for Interpretable Classification of Chest Radiology Reports

链接：https://arxiv.org/abs/2505.00191

作者：Yuyan Ge,Kwan Ho Ryan Chan,Pablo Messina,René Vidal

类目：Computation and Language (cs.CL)

关键词：improving diagnostic accuracy, analyzing radiology reports, reducing workload, development of AI-based, lead to significant

备注： 12 pages, 4 figures

点击查看摘要

Abstract:The development of AI-based methods for analyzing radiology reports could lead to significant advances in medical diagnosis--from improving diagnostic accuracy to enhancing efficiency and reducing workload. However, the lack of interpretability in these methods has hindered their adoption in clinical settings. In this paper, we propose an interpretable-by-design framework for classifying radiology reports. The key idea is to extract a set of most informative queries from a large set of reports and use these queries and their corresponding answers to predict a diagnosis. Thus, the explanation for a prediction is, by construction, the set of selected queries and answers. We use the Information Pursuit framework to select informative queries, the Flan-T5 model to determine if facts are present in the report, and a classifier to predict the disease. Experiments on the MIMIC-CXR dataset demonstrate the effectiveness of the proposed method, highlighting its potential to enhance trust and usability in medical AI.

30. 【2505.00150】Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

链接：https://arxiv.org/abs/2505.00150

作者：Minh-Hao Van,Xintao Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：provided enhanced communication, enhanced communication channels, thoughts and opinions, rapid evolution, evolution of social

备注：

点击查看摘要

Abstract:The rapid evolution of social media has provided enhanced communication channels for individuals to create online content, enabling them to express their thoughts and opinions. Multimodal memes, often utilized for playful or humorous expressions with visual and textual elements, are sometimes misused to disseminate hate speech against individuals or groups. While the detection of hateful memes is well-researched, developing effective methods to transform hateful content in memes remains a significant challenge. Leveraging the powerful generation and reasoning capabilities of Vision-Language Models (VLMs), we address the tasks of detecting and mitigating hateful content. This paper presents two key contributions: first, a definition-guided prompting technique for detecting hateful memes, and second, a unified framework for mitigating hateful content in memes, named UnHateMeme, which works by replacing hateful textual and/or visual components. With our definition-guided prompts, VLMs achieve impressive performance on hateful memes detection task. Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a strong capability to convert hateful memes into non-hateful forms that meet human-level criteria for hate speech and maintain multimodal coherence between image and text. Through empirical experiments, we show the effectiveness of state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the proposed tasks, providing a comprehensive analysis of their respective strengths and limitations for these tasks. This paper aims to shed light on important applications of VLMs for ensuring safe and respectful online environments.

31. 【2505.00147】AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

链接：https://arxiv.org/abs/2505.00147

作者：Yinghui He,Abhishek Panigrahi,Yong Lin,Sanjeev Arora

类目：Computation and Language (cs.CL)

关键词：In-context learning, problem-solving capability, capability when provided, provided with suitable, ICL performance

备注：

点击查看摘要

Abstract:In-context learning (ICL) allows a language model to improve its problem-solving capability when provided with suitable information in context. Since the choice of in-context information can be determined based on the problem itself, in-context learning is analogous to human learning from teachers in a classroom. Recent works (Didolkar et al., 2024a; 2024b) show that ICL performance can be improved by leveraging a frontier large language model's (LLM) ability to predict required skills to solve a problem, popularly referred to as an LLM's metacognition, and using the recommended skills to construct necessary in-context examples. While this skill-based strategy boosts ICL performance in larger models, its gains on small language models (SLMs) have been minimal, highlighting a performance gap in ICL capabilities. We investigate this gap and show that skill-based prompting can hurt SLM performance on easy questions by introducing unnecessary information, akin to cognitive overload. To address this, we introduce AdaptMI, an adaptive approach to selecting skill-based in-context Math Instructions for SLMs. Inspired by cognitive load theory from human pedagogy, our method only introduces skill-based examples when the model performs poorly. We further propose AdaptMI+, which adds examples targeted to the specific skills missing from the model's responses. On 5-shot evaluations across popular math benchmarks and five SLMs (1B--7B; Qwen, Llama), AdaptMI+ improves accuracy by up to 6% over naive skill-based strategies.

32. 【2505.00127】Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs

链接：https://arxiv.org/abs/2505.00127

作者：Jinyan Su,Jennifer Healey,Preslav Nakov,Claire Cardie

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, increasingly optimized, Large, reasoning

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs' self-awareness in reasoning length adaptation.

33. 【2505.00114】Fine-Tuning LLMs for Low-Resource Dialect Translation: The Case of Lebanese

链接：https://arxiv.org/abs/2505.00114

作者：Silvana Yakhni,Ali Chehab

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, low-resource Lebanese dialect, effectiveness of Large, versus larger translated

备注：

点击查看摘要

Abstract:This paper examines the effectiveness of Large Language Models (LLMs) in translating the low-resource Lebanese dialect, focusing on the impact of culturally authentic data versus larger translated datasets. We compare three fine-tuning approaches: Basic, contrastive, and grammar-hint tuning, using open-source Aya23 models. Experiments reveal that models fine-tuned on a smaller but culturally aware Lebanese dataset (LW) consistently outperform those trained on larger, non-native data. The best results were achieved through contrastive fine-tuning paired with contrastive prompting, which indicates the benefits of exposing translation models to bad examples. In addition, to ensure authentic evaluation, we introduce LebEval, a new benchmark derived from native Lebanese content, and compare it to the existing FLoRes benchmark. Our findings challenge the "More Data is Better" paradigm and emphasize the crucial role of cultural authenticity in dialectal translation. We made our datasets and code available on Github.

34. 【2505.00105】Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques

链接：https://arxiv.org/abs/2505.00105

作者：Naamán Huerga-Pérez,Rubén Álvarez,Rubén Ferrero-Guillén,Alberto Martínez-Gutiérrez,Javier Díez-González

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB)

关键词：Retrieval-Augmented Generation enhances, Generation enhances language, external knowledge bases, enhances language models, retrieving relevant information

备注： 13 pages, 9 figures, 1 table

点击查看摘要

Abstract:Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges. To address this issue, we systematically investigate on MTEB benchmark two complementary optimization strategies: quantization, evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8), and dimensionality reduction, assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders. Our results show that float8 quantization achieves a 4x storage reduction with minimal performance degradation (0.3%), significantly outperforming int8 quantization at the same compression level, being simpler to implement. PCA emerges as the most effective dimensionality reduction technique. Crucially, combining moderate PCA (e.g., retaining 50% dimensions) with float8 quantization offers an excellent trade-off, achieving 8x total compression with less performance impact than using int8 alone (which provides only 4x compression). To facilitate practical application, we propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration that maximizes performance within their specific memory constraints.

35. 【2505.00065】ConSens: Assessing context grounding in open-book question answering

链接：https://arxiv.org/abs/2505.00065

作者：Ivan Vankov,Matyo Ivanov,Adriana Correia,Victor Botev

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, task requires generating, open-book question answering, demonstrated considerable success

备注： 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA), where the task requires generating answers grounded in a provided external context. A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge, which can be outdated, incomplete, or incorrect. Existing evaluation methods, primarily based on the LLM-as-a-judge approach, face significant limitations, including biases, scalability issues, and dependence on costly external systems. To address these challenges, we propose a novel metric that contrasts the perplexity of the model response under two conditions: when the context is provided and when it is not. The resulting score quantifies the extent to which the model's answer relies on the provided context. The validity of this metric is demonstrated through a series of experiments that show its effectiveness in identifying whether a given answer is grounded in the provided context. Unlike existing approaches, this metric is computationally efficient, interpretable, and adaptable to various use cases, offering a scalable and practical solution to assess context utilization in open-book QA systems.

36. 【2505.00063】GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

链接：https://arxiv.org/abs/2505.00063

作者：Siqi Li,Yufan Shen,Xiangnan Chen,Jiayi Chen,Hengwei Ju,Haodong Duan,Song Mao,Hongbin Zhou,Bo Zhang,Pinlong Cai,Licheng Wen,Botian Shi,Yong Liu,Xinyu Cai,Yu Qiao

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal large language, large language models, creating a wide, General Document Intelligence, rapid advancement

备注：

点击查看摘要

Abstract:The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 1.9k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate the GDI-Bench on various open-source and closed-source models, conducting decoupled analyses in the visual and reasoning domains. For instance, the GPT-4o model excels in reasoning tasks but exhibits limitations in visual capabilities. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI Model that mitigates the issue of catastrophic forgetting during the supervised fine-tuning (SFT) process through a intelligence-preserving training strategy. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and model will be open source.

37. 【2505.00061】Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems

链接：https://arxiv.org/abs/2505.00061

作者：Sahar Yarmohammadtoosky,Yiyun Zhou,Victoria Yaneva,Peter Baldwin,Saed Rezayi,Brian Clauser,Polina Harikeo

类目：Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：transformer-based automated short-answer, study examines vulnerabilities, automated short-answer grading, medical education, gaming strategies

备注：

点击查看摘要

Abstract:This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system's weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the systems' robustness. Our results indicate that these methods significantly reduce the susceptibility of grading systems to such manipulations, especially when combined with ensemble techniques like majority voting and ridge regression, which further improve the system's defense against sophisticated adversarial inputs. Additionally, employing large language models such as GPT-4 with varied prompting techniques has shown promise in recognizing and scoring gaming strategies effectively. The findings underscore the importance of continuous improvements in AI-driven educational tools to ensure their reliability and fairness in high-stakes settings.

38. 【2505.00060】Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.5

链接：https://arxiv.org/abs/2505.00060

作者：Jeho Choi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, enabling natural language, real-world Business Intelligence, shown promise

备注： 6 pages, 1 table

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in enabling natural language interfaces for structured data querying through text-to-SQL generation. However, their application in real-world Business Intelligence (BI) contexts remains limited due to semantic hallucinations, structural errors, and a lack of domain-specific evaluation frameworks. In this study, we propose a Fact-Consistency Evaluation Framework for assessing the semantic accuracy of LLM-generated SQL outputs using Exaone 3.5--an instruction-tuned, bilingual LLM optimized for enterprise tasks. We construct a domain-specific benchmark comprising 219 natural language business questions across five SQL complexity levels, derived from actual sales data in LG Electronics' internal BigQuery environment. Each question is paired with a gold-standard SQL query and a validated ground-truth answer. We evaluate model performance using answer accuracy, execution success rate, semantic error rate, and non-response rate. Experimental results show that while Exaone 3.5 performs well on simple aggregation tasks (93% accuracy in L1), it exhibits substantial degradation in arithmetic reasoning (4% accuracy in H1) and grouped ranking tasks (31% in H4), with semantic errors and non-responses concentrated in complex cases. Qualitative error analysis further identifies common failure types such as misapplied arithmetic logic, incomplete filtering, and incorrect grouping operations. Our findings highlight the current limitations of LLMs in business-critical environments and underscore the need for fact-consistency validation layers and hybrid reasoning approaches. This work contributes a reproducible benchmark and evaluation methodology for advancing reliable natural language interfaces to structured enterprise data systems.

39. 【2505.00059】BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

链接：https://arxiv.org/abs/2505.00059

作者：Paige Tuttösí,Mantaj Dhillon,Luna Sang,Shane Eastwood,Poorvi Bhatia,Quang Minh Dinh,Avni Kapoor,Yewon Jin,Angelica Lim

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：reached human performance, automatic speech recognition, reported metrics, reached human, ASR

备注： Accepted to Computer Speech and Language, Special issue: Multi-Speaker, Multi-Microphone, and Multi-Modal Distant Speech Recognition (September 2025)

点击查看摘要

Abstract:Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.

40. 【2505.00057】A Report on the llms evaluating the high school questions

链接：https://arxiv.org/abs/2505.00057

作者：Zhu Jiawei,Chen Wei

类目：Computation and Language (cs.CL)

关键词：large language models, school science questions, aims to evaluate, evaluate the performance, performance of large

备注：

点击查看摘要

Abstract:This report aims to evaluate the performance of large language models (LLMs) in solving high school science questions and to explore their potential applications in the educational field. With the rapid development of LLMs in the field of natural language processing, their application in education has attracted widespread attention. This study selected mathematics exam questions from the college entrance examinations (2019-2023) as evaluation data and utilized at least eight LLM APIs to provide answers. A comprehensive assessment was conducted based on metrics such as accuracy, response time, logical reasoning, and creativity. Through an in-depth analysis of the evaluation results, this report reveals the strengths and weaknesses of LLMs in handling high school science questions and discusses their implications for educational practice. The findings indicate that although LLMs perform excellently in certain aspects, there is still room for improvement in logical reasoning and creative problem-solving. This report provides an empirical foundation for further research and application of LLMs in the educational field and offers suggestions for improvement.

41. 【2505.00056】Clustering Internet Memes Through Template Matching and Multi-Dimensional Similarity

链接：https://arxiv.org/abs/2505.00056

作者：Tygo Bloem,Filip Ilievski

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：virality modeling, toxicity detection, similar Internet memes, critical for toxicity, received little attention

备注：

点击查看摘要

Abstract:Meme clustering is critical for toxicity detection, virality modeling, and typing, but it has received little attention in previous research. Clustering similar Internet memes is challenging due to their multimodality, cultural context, and adaptability. Existing approaches rely on databases, overlook semantics, and struggle to handle diverse dimensions of similarity. This paper introduces a novel method that uses template-based matching with multi-dimensional similarity features, thus eliminating the need for predefined databases and supporting adaptive matching. Memes are clustered using local and global features across similarity categories such as form, visual content, text, and identity. Our combined approach outperforms existing clustering methods, producing more consistent and coherent clusters, while similarity-based feature sets enable adaptability and align with human intuition. We make all supporting code publicly available to support subsequent research. Code: this https URL

42. 【2505.00050】Emotional Analysis of Fashion Trends Using Social Media and AI: Sentiment Analysis on Twitter for Fashion Trend Forecasting

链接：https://arxiv.org/abs/2505.00050

作者：Aayam Bansal,Agneya Tharun

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Twitter data, social media sentiment, social media, Twitter, sentiment

备注： 13 pages

点击查看摘要

Abstract:This study explores the intersection of fashion trends and social media sentiment through computational analysis of Twitter data using the T4SA (Twitter for Sentiment Analysis) dataset. By applying natural language processing and machine learning techniques, we examine how sentiment patterns in fashion-related social media conversations can serve as predictors for emerging fashion trends. Our analysis involves the identification and categorization of fashion-related content, sentiment classification with improved normalization techniques, time series decomposition, statistically validated causal relationship modeling, cross-platform sentiment comparison, and brand-specific sentiment analysis. Results indicate correlations between sentiment patterns and fashion theme popularity, with accessories and streetwear themes showing statistically significant rising trends. The Granger causality analysis establishes sustainability and streetwear as primary trend drivers, showing bidirectional relationships with several other themes. The findings demonstrate that social media sentiment analysis can serve as an effective early indicator of fashion trend trajectories when proper statistical validation is applied. Our improved predictive model achieved 78.35% balanced accuracy in sentiment classification, establishing a reliable foundation for trend prediction across positive, neutral, and negative sentiment categories.

43. 【2505.00049】Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications

链接：https://arxiv.org/abs/2505.00049

作者：Wenhan Dong,Yuemeng Zhao,Zhen Sun,Yule Liu,Zifan Peng,Jingyi Zheng,Zongmin Zhang,Ziyi Zhang,Jun Wu,Ruiming Wang,Shengmin Xu,Xinyi Huang,Xinlei He

类目：Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：large language models, language models, trustworthy AI alignment, large language, crucial for understanding

备注： 26 pages,7 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used in human-centered tasks, assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. While existing reviews have covered some aspects of related research, several important areas have not been systematically discussed, including detailed discussions of diverse psychological tests, LLM-specific psychological datasets, and the applications of LLMs with psychological traits. To address this gap, we systematically review six key dimensions of applying psychological theories to LLMs: (1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation. Our analysis highlights both the strengths and limitations of current methods. While some LLMs exhibit reproducible personality patterns under specific prompting schemes, significant variability remains across tasks and settings. Recognizing methodological challenges such as mismatches between psychological tools and LLMs' capabilities, as well as inconsistencies in evaluation practices, this study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.

44. 【2505.00047】Base Models Beat Aligned Models at Randomness and Creativity

链接：https://arxiv.org/abs/2505.00047

作者：Peter West,Christopher Potts

类目：Computation and Language (cs.CL)

关键词：human feedback making, LLM development, Alignment has quickly, ingredient in LLM, models act safely

备注：

点击查看摘要

Abstract:Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.

45. 【2505.00039】Graph RAG for Legal Norms: A Hierarchical and Temporal Approach

链接：https://arxiv.org/abs/2505.00039

作者：Hudson de Martim

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Retrieval Augmented Generation, Graph Retrieval Augmented, Augmented Generation, Retrieval Augmented, Graph RAG

备注：

点击查看摘要

Abstract:This article proposes an adaptation of Graph Retrieval Augmented Generation (Graph RAG) specifically designed for the analysis and comprehension of legal norms, which are characterized by their predefined hierarchical structure, extensive network of internal and external references and multiple temporal versions. By combining structured knowledge graphs with contextually enriched text segments, Graph RAG offers a promising solution to address the inherent complexity and vast volume of legal data. The integration of hierarchical structure and temporal evolution into knowledge graphs - along with the concept of comprehensive Text Units - facilitates the construction of richer, interconnected representations of legal knowledge. Through a detailed analysis of Graph RAG and its application to legal norm datasets, this article aims to significantly advance the field of Artificial Intelligence applied to Law, creating opportunities for more effective systems in legal research, legislative analysis, and decision support.

46. 【2505.00038】HyPerAlign: Hypotheses-driven Personalized Alignment

链接：https://arxiv.org/abs/2505.00038

作者：Cristina Garbacea,Chenhao Tan

类目：Computation and Language (cs.CL)

关键词：align large language, large language models, human users based, LLM models, real-world use cases

备注：

点击查看摘要

Abstract:Alignment algorithms are widely used to align large language models (LLMs) to human users based on preference annotations that reflect their intended real-world use cases. Typically these (often divergent) preferences are aggregated over a diverse set of users, resulting in fine-tuned models that are aligned to the ``average-user'' preference. Nevertheless, current models are used by individual users in very specific contexts and situations, emphasizing the need for user-dependent preference control. In this work we address the problem of personalizing LLM outputs to their users, aiming to generate customized responses tailored to individual users, instead of generic outputs that emulate the collective voices of diverse populations. We propose a novel interpretable and sample-efficient hypotheses-driven personalization approach (HyPerAlign) where given few-shot examples written by a particular user, we first infer hypotheses about their communication strategies, personality and writing style, then prompt LLM models with these hypotheses and user specific attributes to generate customized outputs. We conduct experiments on two different personalization tasks, authorship attribution and deliberative alignment, with datasets from diverse domains (news articles, blog posts, emails, jailbreaking benchmarks), and demonstrate the superiority of hypotheses-driven personalization approach when compared to preference-based fine-tuning methods. For deliberative alignment, the helpfulness of LLM models is improved by up to $70\%$ on average. For authorship attribution, results indicate consistently high win-rates (commonly $90\%$) against state-of-the-art preference fine-tuning approaches for LLM personalization across diverse user profiles and LLM models. Overall, our approach represents an interpretable and sample-efficient strategy for the personalization of LLM models to individual users.

47. 【2505.00036】A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies

链接：https://arxiv.org/abs/2505.00036

作者：Zhongren Chen,Joshua Kalla,Quan Le,Shinpei Nakamura-Sakai,Jasjeet Sekhon,Ruixiao Wang

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large Language Models, Language Models, Large Language, recent years, significant concern

备注：

点击查看摘要

Abstract:In recent years, significant concern has emerged regarding the potential threat that Large Language Models (LLMs) pose to democratic societies through their persuasive capabilities. We expand upon existing research by conducting two survey experiments and a real-world simulation exercise to determine whether it is more cost effective to persuade a large number of voters using LLM chatbots compared to standard political campaign practice, taking into account both the "receive" and "accept" steps in the persuasion process (Zaller 1992). These experiments improve upon previous work by assessing extended interactions between humans and LLMs (instead of using single-shot interactions) and by assessing both short- and long-run persuasive effects (rather than simply asking users to rate the persuasiveness of LLM-produced content). In two survey experiments (N = 10,417) across three distinct political domains, we find that while LLMs are about as persuasive as actual campaign ads once voters are exposed to them, political persuasion in the real-world depends on both exposure to a persuasive message and its impact conditional on exposure. Through simulations based on real-world parameters, we estimate that LLM-based persuasion costs between \$48-\$74 per persuaded voter compared to \$100 for traditional campaign methods, when accounting for the costs of exposure. However, it is currently much easier to scale traditional campaign persuasion methods than LLM-based persuasion. While LLMs do not currently appear to have substantially greater potential for large-scale political persuasion than existing non-LLM methods, this may change as LLM capabilities continue to improve and it becomes easier to scalably encourage exposure to persuasive LLMs.

48. 【2505.00035】Linguistic Complexity and Socio-cultural Patterns in Hip-Hop Lyrics

链接：https://arxiv.org/abs/2505.00035

作者：Aayam Bansal,Raghav Agarwal,Kaashvi Jain

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：comprehensive computational framework, paper presents, presents a comprehensive, comprehensive computational, computational framework

备注： 12 pages

点击查看摘要

Abstract:This paper presents a comprehensive computational framework for analyzing linguistic complexity and socio-cultural trends in hip-hop lyrics. Using a dataset of 3,814 songs from 146 influential artists spanning four decades (1980-2020), we employ natural language processing techniques to quantify multiple dimensions of lyrical complexity. Our analysis reveals a 23.7% increase in vocabulary diversity over the study period, with East Coast artists demonstrating 17.3% higher lexical variation than other regions. Rhyme density increased by 34.2% across all regions, with Midwest artists exhibiting the highest technical complexity (3.04 rhymes per line). Topic modeling identified significant shifts in thematic content, with social justice themes decreasing from 28.5% to 13.8% of content while introspective themes increased from 7.6% to 26.3%. Sentiment analysis demon- strated that lyrics became significantly more negative during sociopolitical crises, with polarity decreasing by 0.31 following major social unrest. Multi-dimensional analysis revealed four dis- tinct stylistic approaches that correlate strongly with geographic origin (r=0.68, p!0.001) and time period (r=0.59, p0.001). These findings establish quantitative evidence for the evolution of hip- hop as both an art form and a reflection of societal dynamics, providing insights into the interplay between linguistic innovation and cultural context in popular music.

49. 【2505.00034】Improving Phishing Email Detection Performance of Small Large Language Models

链接：https://arxiv.org/abs/2505.00034

作者：Zijie Lin,Zikang Liu,Hanbo Fan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, natural language processing, demonstrated remarkable performance, Large language, phishing email detection

备注：

点击查看摘要

Abstract:Large language models(LLMs) have demonstrated remarkable performance on many natural language processing(NLP) tasks and have been employed in phishing email detection research. However, in current studies, well-performing LLMs typically contain billions or even tens of billions of parameters, requiring enormous computational resources. To reduce computational costs, we investigated the effectiveness of small-parameter LLMs for phishing email detection. These LLMs have around 3 billion parameters and can run on consumer-grade GPUs. However, small LLMs often perform poorly in phishing email detection task. To address these issues, we designed a set of methods including Prompt Engineering, Explanation Augmented Fine-tuning, and Model Ensemble to improve phishing email detection capabilities of small LLMs. We validated the effectiveness of our approach through experiments, significantly improving accuracy on the SpamAssassin dataset from around 0.5 for baseline models like Qwen2.5-1.5B-Instruct to 0.976.

50. 【2505.00033】From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models

链接：https://arxiv.org/abs/2505.00033

作者：Andrew Kiruluta

类目：Computation and Language (cs.CL)

关键词：time varying Fourier, token mixing coefficients, global time varying, Short Time Fourier, Time Fourier Transform

备注：

点击查看摘要

Abstract:We propose a novel spectral generative modeling framework for natural language processing that jointly learns a global time varying Fourier dictionary and per token mixing coefficients, replacing the ubiquitous self attention mechanism in transformer architectures. By enforcing reconstruction losses in both the time domain (embedding reconstruction) and the frequency domain (via Short Time Fourier Transform magnitude matching) alongside a standard language modeling objective, and fitting a Gaussian Mixture Model (GMM) prior over the learned mixing vectors, our approach achieves competitive perplexity and generation quality on standard benchmarks such as WikiText2 and Penn Treebank. In contrast to the quadratic computation complexity of self attention, our method operates with linear complexity, delivering substantial efficiency gains. We demonstrate that spectral dictionary models can achieve competitive performance compared to transformer baselines while significantly reducing inference latency and memory footprint, offering a compelling alternative for scalable language modeling.

51. 【2505.00032】MDD-LLM: Towards Accuracy Large Language Models for Major Depressive Disorder Diagnosis

链接：https://arxiv.org/abs/2505.00032

作者：Yuyang Sha,Hongxin Pan,Wei Xu,Weiyu Meng,Gang Luo,Xinyu Du,Xiaobing Zhai,Henry H. Y. Tong,Caijuan Shi,Kefeng Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Major depressive disorder, million people worldwide, public health issue, significant public health, Major depressive

备注：

点击查看摘要

Abstract:Major depressive disorder (MDD) impacts more than 300 million people worldwide, highlighting a significant public health issue. However, the uneven distribution of medical resources and the complexity of diagnostic methods have resulted in inadequate attention to this disorder in numerous countries and regions. This paper introduces a high-performance MDD diagnosis tool named MDD-LLM, an AI-driven framework that utilizes fine-tuned large language models (LLMs) and extensive real-world samples to tackle challenges in MDD diagnosis. Therefore, we select 274,348 individual information from the UK Biobank cohort to train and evaluate the proposed method. Specifically, we select 274,348 individual records from the UK Biobank cohort and design a tabular data transformation method to create a large corpus for training and evaluating the proposed approach. To illustrate the advantages of MDD-LLM, we perform comprehensive experiments and provide several comparative analyses against existing model-based solutions across multiple evaluation metrics. Experimental results show that MDD-LLM (70B) achieves an accuracy of 0.8378 and an AUC of 0.8919 (95% CI: 0.8799 - 0.9040), significantly outperforming existing machine learning and deep learning frameworks for MDD diagnosis. Given the limited exploration of LLMs in MDD diagnosis, we examine numerous factors that may influence the performance of our proposed method, such as tabular data transformation techniques and different fine-tuning strategies.

52. 【2505.00031】Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving

链接：https://arxiv.org/abs/2505.00031

作者：Jin Zhang,Flood Sung,Zhilin Yang,Yang Gao,Chongjie Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：synthetic data generated, field of large, utilizing synthetic data, LLM, LEPA

备注：

点击查看摘要

Abstract:In the field of large language model (LLM) post-training, the effectiveness of utilizing synthetic data generated by the LLM itself has been well-presented. However, a key question remains unaddressed: what essential information should such self-generated data encapsulate? Existing approaches only produce step-by-step problem solutions, and fail to capture the abstract meta-knowledge necessary for generalization across similar problems. Drawing insights from cognitive science, where humans employ high-level abstraction to simplify complex problems before delving into specifics, we introduce a novel self-training algorithm: LEarning to Plan before Answering (LEPA). LEPA trains the LLM to formulate anticipatory plans, which serve as abstract meta-knowledge for problem-solving, before engaging with the intricacies of problems. This approach not only outlines the solution generation path but also shields the LLM from the distraction of irrelevant details. During data generation, LEPA first crafts an anticipatory plan based on the problem, and then generates a solution that aligns with both the plan and the problem. LEPA refines the plan through self-reflection, aiming to acquire plans that are instrumental in yielding correct solutions. During model optimization, the LLM is trained to predict both the refined plans and the corresponding solutions. By efficiently extracting and utilizing the anticipatory plans, LEPA demonstrates remarkable superiority over conventional algorithms on various challenging natural language reasoning benchmarks.

53. 【2505.00030】Can Language Models Represent the Past without Anachronism?

链接：https://arxiv.org/abs/2505.00030

作者：Ted Underwood,Laura K. Nelson,Matthew Wilkens

类目：Computation and Language (cs.CL)

关键词：risk of anachronism, understand the risk, language models, Abstract, past

备注：

点击查看摘要

Abstract:Before researchers can use language models to simulate the past, they need to understand the risk of anachronism. We find that prompting a contemporary model with examples of period prose does not produce output consistent with period style. Fine-tuning produces results that are stylistically convincing enough to fool an automated judge, but human evaluators can still distinguish fine-tuned model outputs from authentic historical text. We tentatively conclude that pretraining on period prose may be required in order to reliably simulate historical perspectives for social research.

54. 【2505.00029】Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

链接：https://arxiv.org/abs/2505.00029

作者：Yijie Hong,Xiaofei Yin,Xinzhong Wang,Yi Tu,Ya Guo,Sufeng Duan,Weiqiang Wang,Lingyong Fang,Depeng Wang,Huijia Zhu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Vision Language, Vision Language Models, Large Vision, Vision Language, extensive multimodal pre-training

备注： 13 pages, 3 figures

点击查看摘要

Abstract:Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.

55. 【2505.00028】Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2505.00028

作者：Pengchao Feng,Ziyang Ma,Wenxi Chen,Yao Li,Sheng Wang,Kai Yu,Xie Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：including achieving lower, achieving lower latency, garnered increasing research, increasing research attention, research attention due

备注：

点击查看摘要

Abstract:In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.

56. 【2505.00027】Extracting Abstraction Dimensions by Identifying Syntax Pattern from Texts

链接：https://arxiv.org/abs/2505.00027

作者：Jian Zhou,Jiazheng Li,Sirui Zhuge,Hai Zhuge

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：discovering subject dimension, efficiently operate texts, automatically discovering subject, dimension, automatically discovering

备注： 25pages, 3 figures, 8 tables

点击查看摘要

Abstract:This paper proposed an approach to automatically discovering subject dimension, action dimension, object dimension and adverbial dimension from texts to efficiently operate texts and support query in natural language. The high quality of trees guarantees that all subjects, actions, objects and adverbials and their subclass relations within texts can be represented. The independency of trees ensures that there is no redundant representation between trees. The expressiveness of trees ensures that the majority of sentences can be accessed from each tree and the rest of sentences can be accessed from at least one tree so that the tree-based search mechanism can support querying in natural language. Experiments show that the average precision, recall and F1-score of the abstraction trees constructed by the subclass relations of subject, action, object and adverbial are all greater than 80%. The application of the proposed approach to supporting query in natural language demonstrates that different types of question patterns for querying subject or object have high coverage of texts, and searching multiple trees on subject, action, object and adverbial according to the question pattern can quickly reduce search space to locate target sentences, which can support precise operation on texts.

57. 【2505.00026】heory of Mind in Large Language Models: Assessment and Enhancement

链接：https://arxiv.org/abs/2505.00026

作者：Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Theory of Mind, human social intelligence, Large Language Models, others' mental states-is, mental states-is fundamental

备注：

点击查看摘要

Abstract:Theory of Mind (ToM)-the ability to infer and reason about others' mental states-is fundamental to human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, it is crucial to assess and enhance their capacity to interpret and respond to human mental states. In this paper, we review LLMs' ToM capabilities by examining both evaluation benchmarks and the strategies designed to improve them. We focus on widely adopted story-based benchmarks and provide an in-depth analysis of methods aimed at enhancing ToM in LLMs. Furthermore, we outline promising future research directions informed by recent benchmarks and state-of-the-art approaches. Our survey serves as a valuable resource for researchers interested in advancing LLMs' ToM capabilities.

58. 【2505.00025】A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

链接：https://arxiv.org/abs/2505.00025

作者：Mingda Zhang,Jianglong Qin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：ChatGPT demonstrating significant, demonstrating significant capabilities, computational resource requirements, actual medical scenarios, deployment environment limitations

备注： 14 pages, 1 figures

点击查看摘要

Abstract:In recent years, despite foundation models like DeepSeek-R1 and ChatGPT demonstrating significant capabilities in general tasks, professional knowledge barriers, computational resource requirements, and deployment environment limitations have severely hindered their application in actual medical scenarios. Addressing these challenges, this paper proposes an efficient lightweight medical vertical large language model architecture method, systematically solving the lightweight problem of medical large models from three dimensions: knowledge acquisition, model compression, and computational optimization. At the knowledge acquisition level, a knowledge transfer pipeline is designed from the fine-tuned DeepSeek-R1-Distill-70B teacher model to the DeepSeek-R1-Distill-7B student model, and Low-Rank Adaptation (LoRA) technology is adopted to precisely adjust key attention layers. At the model compression level, compression techniques including 4-bit weight quantization are implemented while preserving the core representation ability for medical reasoning. At the computational optimization level, inference optimization techniques such as Flash Attention acceleration and continuous batching are integrated, and a professional prompt template system is constructed to adapt to different types of medical problems. Experimental results on medical question-answering datasets show that the method proposed in this paper maintains professional accuracy while reducing memory consumption by 64.7\% and inference latency by 12.4\%, providing an effective solution for the application of medical large models in resource-constrained environments such as edge computing devices.

59. 【2505.00024】Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

链接：https://arxiv.org/abs/2505.00024

作者：Shaokun Zhang,Yi Dong,Jieyu Zhang,Jan Kautz,Bryan Catanzaro,Andrew Tao,Qingyun Wu,Zhiding Yu,Guilin Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Enabling large language, text generation tasks, Enabling large, generation tasks, pivotal strategy

备注： 13 pages, 4 tables, 5 figures

点击查看摘要

Abstract:Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text generation tasks. Prior work typically enhances tool-use abilities by either applying supervised fine-tuning (SFT) to enforce tool-call correctness or distilling reasoning traces from stronger models for SFT. However, both approaches fall short, either omitting reasoning entirely or producing imitative reasoning that limits generalization. Inspired by the success of DeepSeek-R1 in eliciting reasoning through rule-based reinforcement learning, we develop the Nemotron-Research-Tool-N1 series of tool-using language models using a similar training paradigm. Instead of restrictively supervising intermediate reasoning traces distilled from stronger models, Nemotron-Research-Tool-N1 is optimized with a binary reward that evaluates only the structural validity and functional correctness of tool invocations. This lightweight supervision allows the model to autonomously internalize reasoning strategies, without the need for annotated reasoning trajectories. Experiments on the BFCL and API-Bank benchmarks show that Nemotron-Research-Tool-N1-7B and Nemotron-Research-Tool-N1-14B, built on Qwen-2.5-7B/14B-Instruct, achieve state-of-the-art results, outperforming GPT-4o on both evaluations.

60. 【2505.00023】CORG: Generating Answers from Complex, Interrelated Contexts

链接：https://arxiv.org/abs/2505.00023

作者：Hyunji Lee,Franck Dernoncourt,Trung Bui,Seunghyun Yoon

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：knowledge frequently recurs, outdated information, real-world corpus, knowledge frequently, leading to complex

备注： published at Findings of NAACL 2025

点击查看摘要

Abstract:In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. CORG consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that CORG balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.

61. 【2505.00022】Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

链接：https://arxiv.org/abs/2505.00022

作者：Thomas F Burns,Letitia Parcalabescu,Stephan Wäldchen,Michael Barlow,Gregor Ziegltrum,Volker Stampa,Bastian Harren,Björn Deiseroth

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Scaling data quantity, large language models, Scaling data, training efficiency, quantity is essential

备注： 10 pages, 3 figures

点击查看摘要

Abstract:Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.

62. 【2505.00021】Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss

链接：https://arxiv.org/abs/2505.00021

作者：Zhuoang Cai,Zhenghao Li,Yang Liu,Liyuan Guo,Yangqiu Song

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：food hazard detection, overlapping semantic categories, hazard detection due, anced data distribution, food hazard

备注：

点击查看摘要

Abstract:Classification tasks often suffer from imbal- anced data distribution, which presents chal- lenges in food hazard detection due to severe class imbalances, short and unstructured text, and overlapping semantic categories. In this paper, we present our system for SemEval- 2025 Task 9: Food Hazard Detection, which ad- dresses these issues by applying data augmenta- tion techniques to improve classification perfor- mance. We utilize transformer-based models, BERT and RoBERTa, as backbone classifiers and explore various data balancing strategies, including random oversampling, Easy Data Augmentation (EDA), and focal loss. Our ex- periments show that EDA effectively mitigates class imbalance, leading to significant improve- ments in accuracy and F1 scores. Furthermore, combining focal loss with oversampling and EDA further enhances model robustness, par- ticularly for hard-to-classify examples. These findings contribute to the development of more effective NLP-based classification models for food hazard detection.

63. 【2505.00020】Beyond Public Access in LLM Pre-Training Data

链接：https://arxiv.org/abs/2505.00020

作者：Sruly Rosenblat,Tim O'Reilly,Ilan Strauss

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：legally obtained dataset, DE-COP membership inference, membership inference attack, inference attack method, copyrighted O'Reilly Media

备注：

点击查看摘要

Abstract:Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content (AUROC = 82\%), compared to OpenAI's earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O'Reilly Media content when tested (AUROC $\approx$ 50\%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training

64. 【2505.00019】An Empirical Study on Prompt Compression for Large Language Models

链接：https://arxiv.org/abs/2505.00019

作者：Zheng Zhang,Jinyi Li,Yihuai Lan,Xiang Wang,Hao Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：enables Large Language, Large Language Models, engineering enables Large, Large Language, enables Large

备注： Accepted by Building Trust Workshop at ICLR 2025

点击查看摘要

Abstract:Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at this https URL.

65. 【2505.00017】ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation

链接：https://arxiv.org/abs/2505.00017

作者：Dezheng Han,Yibin Jia,Ruxiao Chen,Wenjie Han,Shuaishuai Guo,Jianbo Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

关键词：large language models, fully automated cell, graph structured feature, structured feature marker, feature marker database

备注：

点击查看摘要

Abstract:To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.

66. 【2505.00016】Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning

链接：https://arxiv.org/abs/2505.00016

作者：Josefa Lia Stoisser,Marc Boubnovski Martell,Julien Fauqueur

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：teaching large language, manipulate tabular data, large language models, query generation, work reframes

备注：

点击查看摘要

Abstract:This work reframes the Text-to-SQL task as a pathway for teaching large language models (LLMs) to reason over and manipulate tabular data--moving beyond the traditional focus on query generation. We propose a two-stage framework that leverages SQL supervision to develop transferable table reasoning capabilities. First, we synthesize detailed chain-of-thought (CoT) traces from real-world SQL queries, providing step-by-step, clause-level supervision that teaches the model how to traverse, filter, and aggregate table fields. Second, we introduce a Group Relative Policy Optimization (GRPO) reinforcement learning objective that connects SQL execution accuracy to generalizable reasoning by encouraging steps that extend beyond task-specific syntax and transfer across datasets. Empirically, our approach improves performance on standard Text-to-SQL benchmarks and achieves substantial gains on reasoning-intensive datasets such as BIRD and CRT-QA, demonstrating enhanced generalization and interpretability. Specifically, the distilled-quantized LLaMA model achieved a 20\% increase in accuracy when trained on Text-to-SQL tasks, while Qwen achieved a 5\% increase. These results suggest that SQL can serve not only as a target formalism but also as an effective scaffold for learning robust, transferable reasoning over structured data.

67. 【2505.00015】Design and Application of Multimodal Large Language Model Based System for End to End Automation of Accident Dataset Generation

链接：https://arxiv.org/abs/2505.00015

作者：MD Thamed Bin Zaman Chowdhury,Moazzem Hossain

类目：Computation and Language (cs.CL)

关键词：traffic accidents remain, Large Language Models, socio-economic issue, issue in developing, developing countries

备注： Shortened the abstract to fit within 1920 characters. This paper is currently under Review in Elsevier journal 'Accident Analysis Prevention'

点击查看摘要

Abstract:Road traffic accidents remain a major public safety and socio-economic issue in developing countries like Bangladesh. Existing accident data collection is largely manual, fragmented, and unreliable, resulting in underreporting and inconsistent records. This research proposes a fully automated system using Large Language Models (LLMs) and web scraping techniques to address these challenges. The pipeline consists of four components: automated web scraping code generation, news collection from online sources, accident news classification with structured data extraction, and duplicate removal. The system uses the multimodal generative LLM Gemini-2.0-Flash for seamless automation. The code generation module classifies webpages into pagination, dynamic, or infinite scrolling categories and generates suitable Python scripts for scraping. LLMs also classify and extract key accident information such as date, time, location, fatalities, injuries, road type, vehicle types, and pedestrian involvement. A deduplication algorithm ensures data integrity by removing duplicate reports. The system scraped 14 major Bangladeshi news sites over 111 days (Oct 1, 2024 - Jan 20, 2025), processing over 15,000 news articles and identifying 705 unique accidents. The code generation module achieved 91.3% calibration and 80% validation accuracy. Chittagong reported the highest number of accidents (80), fatalities (70), and injuries (115), followed by Dhaka, Faridpur, Gazipur, and Cox's Bazar. Peak accident times were morning (8-9 AM), noon (12-1 PM), and evening (6-7 PM). A public repository was also developed with usage instructions. This study demonstrates the viability of an LLM-powered, scalable system for accurate, low-effort accident data collection, providing a foundation for data-driven road safety policymaking in Bangladesh.

68. 【2505.00014】Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and Möbius Strips

链接：https://arxiv.org/abs/2505.00014

作者：Vinit K. Chavan

类目：Computation and Language (cs.CL)

关键词：Recent advances, emphasized the role, geometry in capturing, unconstrained Euclidean spaces, Recent

备注： 10 pages, 6 figures. Code available at [this https URL](https://github.com/vinitchavan/manifold-embedding-nlp)

点击查看摘要

Abstract:Recent advances in representation learning have emphasized the role of embedding geometry in capturing semantic structure. Traditional sentence embeddings typically reside in unconstrained Euclidean spaces, which may limit their ability to reflect complex relationships in language. In this work, we introduce a novel framework that constrains sentence embeddings to lie on continuous manifolds -- specifically the unit sphere, torus, and Möbius strip -- using triplet loss as the core training objective. By enforcing differential geometric constraints on the output space, our approach encourages the learning of embeddings that are both discriminative and topologically structured. We evaluate our method on benchmark datasets (AG News and MBTI) and compare it to classical baselines including TF-IDF, Word2Vec, and unconstrained Keras-derived embeddings. Our results demonstrate that manifold-constrained embeddings, particularly those projected onto spheres and Möbius strips, significantly outperform traditional approaches in both clustering quality (Silhouette Score) and classification performance (Accuracy). These findings highlight the value of embedding in manifold space -- where topological structure complements semantic separation -- offering a new and mathematically grounded direction for geometric representation learning in NLP.

Comments:
10 pages, 6 figures. Code available at this https URL

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2505.00014 [cs.CL]

(or
arXiv:2505.00014v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.00014

Focus to learn more

              arXiv-issued DOI via DataCite</p>

69. 【2505.00013】Performance Evaluation of Emotion Classification in Japanese Using RoBERTa and DeBERTa

链接：https://arxiv.org/abs/2505.00013

作者：Yoichi Takenaka

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Background Practical applications, Background Practical, Practical applications, customer-feedback analysis require, analysis require accurate

备注： 14 pages, 3 tables, 3 appendices. Submitted to New Generation Computing. Includes comparisons between fine-tuned PLMs and LLMs on Japanese emotion classification. Code available at [this https URL](https://pypi.org/project/deberta-emotion-predictor/)

点击查看摘要

Abstract:Background Practical applications such as social media monitoring and customer-feedback analysis require accurate emotion detection for Japanese text, yet resource scarcity and class imbalance hinder model performance. Objective This study aims to build a high-accuracy model for predicting the presence or absence of eight Plutchik emotions in Japanese sentences. Methods Using the WRIME corpus, we transform reader-averaged intensity scores into binary labels and fine-tune four pre-trained language models (BERT, RoBERTa, DeBERTa-v3-base, DeBERTa-v3-large). For context, we also assess two large language models (TinySwallow-1.5B-Instruct and ChatGPT-4o). Accuracy and F1-score serve as evaluation metrics. Results DeBERTa-v3-large attains the best mean accuracy (0.860) and F1-score (0.662), outperforming all other models. It maintains robust F1 across both high-frequency emotions (e.g., Joy, Anticipation) and low-frequency emotions (e.g., Anger, Trust). The LLMs lag, with ChatGPT-4o and TinySwallow-1.5B-Instruct scoring 0.527 and 0.292 in mean F1, respectively. Conclusion The fine-tuned DeBERTa-v3-large model currently offers the most reliable solution for binary emotion classification in Japanese. We release this model as a pip-installable package (pip install deberta-emotion-predictor). Future work should augment data for rare emotions, reduce model size, and explore prompt engineering to improve LLM performance. This manuscript is under review for possible publication in New Generation Computing.

Comments:
14 pages, 3 tables, 3 appendices. Submitted to New Generation Computing. Includes comparisons between fine-tuned PLMs and LLMs on Japanese emotion classification. Code available at this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2505.00013 [cs.CL]

(or
arXiv:2505.00013v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.00013

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Yoichi Takenaka [view email] [v1]
Tue, 22 Apr 2025 07:51:37 UTC (164 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Performance Evaluation of Emotion Classification in Japanese Using RoBERTa and DeBERTa, by Yoichi TakenakaView PDFHTML (experimental)TeX SourceOther Formats
view license

Current browse context: cs.CL

|
next

new
|
recent
| 2025-05

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

a
export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

信息检索

1. 【2505.00649】Investigating Task Arithmetic for Zero-Shot Information Retrieval

链接：https://arxiv.org/abs/2505.00649

作者：Marco Braga,Pranav Kasela,Alessandro Raganato,Gabriella Pasi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Natural Language Processing, Language Processing tasks, Large Language Models, Large Language, Natural Language

备注： Accepted in SIGIR '25

点击查看摘要

2. 【2505.00560】Efficient Recommendation with Millions of Items by Dynamic Pruning of Sub-Item Embeddings

链接：https://arxiv.org/abs/2505.00560

作者：Aleksandr V. Petrov,Craig Macdonald,Nicola Tonellotto

类目：Information Retrieval (cs.IR)

关键词：deploying modern sequential, modern sequential recommender, increases inference latency, sequential recommender models, top highest-scored items

备注： Accepted as a full research paper at SIGIR 2025

点击查看摘要

Abstract:A large item catalogue is a major challenge for deploying modern sequential recommender models, since it makes the memory footprint of the model large and increases inference latency. One promising approach to address this is RecJPQ, which replaces item embeddings with sub-item embeddings. However, slow inference remains problematic because finding the top highest-scored items usually requires scoring all items in the catalogue, which may not be feasible for large catalogues. By adapting dynamic pruning concepts from document retrieval, we propose the RecJPQPrune dynamic pruning algorithm to efficiently find the top highest-scored items without computing the scores of all items in the catalogue. Our RecJPQPrune algorithm is safe-up-to-rank K since it theoretically guarantees that no potentially high-scored item is excluded from the final top K recommendation list, thereby ensuring no impact on effectiveness. Our experiments on two large datasets and three recommendation models demonstrate the efficiency achievable using RecJPQPrune: for instance, on the Tmall dataset with 2.2M items, we can reduce the median model scoring time by 64 times compared to the Transformer Default baseline, and 5.3 times compared to a recent scoring approach called PQTopK. Overall, this paper demonstrates the effective and efficient inference of Transformer-based recommendation models at catalogue scales not previously reported in the literature. Indeed, our RecJPQPrune algorithm can score 2 million items in under 10 milliseconds without GPUs, and without relying on Approximate Nearest Neighbour (ANN) techniques.

3. 【2505.00552】Graph Spectral Filtering with Chebyshev Interpolation for Recommendation

链接：https://arxiv.org/abs/2505.00552

作者：Chanwoo Kim,Jinkyu Sung,Yebonn Han,Joonseok Lee

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：recently gained prominence, Graph convolutional networks, convolutional networks, networks have recently, recently gained

备注： Accepted by SIGIR 2025; 11 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Graph convolutional networks have recently gained prominence in collaborative filtering (CF) for recommendations. However, we identify potential bottlenecks in two foundational components. First, the embedding layer leads to a latent space with limited capacity, overlooking locally observed but potentially valuable preference patterns. Also, the widely-used neighborhood aggregation is limited in its ability to leverage diverse preference patterns in a fine-grained manner. Building on spectral graph theory, we reveal that these limitations stem from graph filtering with a cut-off in the frequency spectrum and a restricted linear form. To address these issues, we introduce ChebyCF, a CF framework based on graph spectral filtering. Instead of a learned embedding, it takes a user's raw interaction history to utilize the full spectrum of signals contained in it. Also, it adopts Chebyshev interpolation to effectively approximate a flexible non-linear graph filter, and further enhances it by using an additional ideal pass filter and degree-based normalization. Through extensive experiments, we verify that ChebyCF overcomes the aforementioned bottlenecks and achieves state-of-the-art performance across multiple benchmarks and reasonably fast inference. Our code is available at this https URL.

4. 【2505.00263】EnronQA: Towards Personalized RAG over Private Documents

链接：https://arxiv.org/abs/2505.00263

作者：Michael J. Ryan,Danmei Xu,Chris Nivera,Daniel Campos

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Retrieval Augmented Generation, Augmented Generation, large language models, bringing knowledge-intensive context, bring local context

备注： 26 pages, 4 figures, 6 tables

点击查看摘要

5. 【2505.00105】Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques

链接：https://arxiv.org/abs/2505.00105

作者：Naamán Huerga-Pérez,Rubén Álvarez,Rubén Ferrero-Guillén,Alberto Martínez-Gutiérrez,Javier Díez-González

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB)

关键词：Retrieval-Augmented Generation enhances, Generation enhances language, external knowledge bases, enhances language models, retrieving relevant information

备注： 13 pages, 9 figures, 1 table

点击查看摘要

6. 【2505.00056】Clustering Internet Memes Through Template Matching and Multi-Dimensional Similarity

链接：https://arxiv.org/abs/2505.00056

作者：Tygo Bloem,Filip Ilievski

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：virality modeling, toxicity detection, similar Internet memes, critical for toxicity, received little attention

备注：

点击查看摘要

7. 【2505.00039】Graph RAG for Legal Norms: A Hierarchical and Temporal Approach

链接：https://arxiv.org/abs/2505.00039

作者：Hudson de Martim

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Retrieval Augmented Generation, Graph Retrieval Augmented, Augmented Generation, Retrieval Augmented, Graph RAG

备注：

点击查看摘要

8. 【2505.00028】Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2505.00028

作者：Pengchao Feng,Ziyang Ma,Wenxi Chen,Yao Li,Sheng Wang,Kai Yu,Xie Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：including achieving lower, achieving lower latency, garnered increasing research, increasing research attention, research attention due

备注：

点击查看摘要

计算机视觉

1. 【2505.00704】Controllable Weather Synthesis and Removal with Video Diffusion Models

链接：https://arxiv.org/abs/2505.00704

作者：Chih-Hao Lin,Zian Wang,Ruofan Liang,Yuxuan Zhang,Sanja Fidler,Shenlong Wang,Zan Gojcic

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating realistic, controllable weather effects, realistic and controllable, Generating, weather

备注：

点击查看摘要

Abstract:Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control. In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects -- including rain, snow, fog, and clouds -- directly into any input video without the need for 3D modeling. Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability. To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methods in weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.

2. 【2505.00703】2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

链接：https://arxiv.org/abs/2505.00703

作者：Dongzhi Jiang,Ziyu Guo,Renrui Zhang,Zhuofan Zong,Hao Li,Le Zhuo,Shilin Yan,Pheng-Ann Heng,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Recent advancements, large language models, reinforcement learning, advancements in large, large language

备注： Project Page: [this https URL](https://github.com/CaraJ7/T2I-R1)

点击查看摘要

3. 【2505.00702】RayZer: A Self-supervised Large View Synthesis Model

链接：https://arxiv.org/abs/2505.00702

作者：Hanwen Jiang,Hao Tan,Peng Wang,Haian Jin,Yue Zhao,Sai Bi,Kai Zhang,Fujun Luan,Kalyan Sunkavalli,Qixing Huang,Georgios Pavlakos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision model trained, Vision model, Vision, camera, scene geometry

备注：

点击查看摘要

Abstract:We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: this https URL

4. 【2505.00693】Robotic Visual Instruction

链接：https://arxiv.org/abs/2505.00693

作者：Yanbang Li,Ziyang Gong,Haoyang Li,Haoyang Li,Xiaoqi Huang,Haolan Kang,Guangping Bai,Xianzheng Ma

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language, human-robot interaction, Visual Instruction Embodied, primary medium, medium for human-robot

备注：

点击查看摘要

Abstract:Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.

5. 【2505.00690】owards Autonomous Micromobility through Scalable Urban Simulation

链接：https://arxiv.org/abs/2505.00690

作者：Wayne Wu,Honglin He,Chaoyuan Zhang,Jack He,Seth Z. Zhao,Ran Gong,Quanyi Li,Bolei Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：utilizes lightweight mobile, lightweight mobile machines, mobile machines moving, urban public spaces, mobility scooters

备注： CVPR 2025 Highlight. Project page: [this https URL](https://metadriverse.github.io/urban-sim/)

点击查看摘要

Abstract:Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM - a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH - a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse terrains and urban structures reveal each robot's strengths and limitations.

6. 【2505.00684】Visual Test-time Scaling for GUI Agent Grounding

链接：https://arxiv.org/abs/2505.00684

作者：Tiange Luo,Lajanugen Logeswaran,Justin Johnson,Honglak Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Vision Language Model, Language Model Agents, visual test-time scaling, Vision Language, Language Model

备注：

点击查看摘要

Abstract:We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6\% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at this https URL.

7. 【2505.00681】MINERVA: Evaluating Complex Video Reasoning

链接：https://arxiv.org/abs/2505.00681

作者：Arsha Nagrani,Sachit Menon,Ahmet Iscen,Shyamal Buch,Ramin Mehran,Nilpa Jha,Anja Hauth,Yukun Zhu,Carl Vondrick,Mikhail Sirotenko,Cordelia Schmid,Tobias Weyand

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：provide outcome supervision, interpretable reasoning steps, video benchmarks, outcome supervision, LLMs are turning

备注：

点击查看摘要

Abstract:Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under this https URL\#minerva.

8. 【2505.00668】Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments

链接：https://arxiv.org/abs/2505.00668

作者：Kirtan Rajesh,Suvidha Rupesh Kumar

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：pressing global concern, traffic-intensive metropolitan areas, harmful pollutants severely, impacts public health, pollutants severely impacts

备注：

点击查看摘要

Abstract:Urban air pollution remains a pressing global concern, particularly in densely populated and traffic-intensive metropolitan areas like Delhi, where exposure to harmful pollutants severely impacts public health. Delhi, being one of the most polluted cities globally, experiences chronic air quality issues due to vehicular emissions, industrial activities, and construction dust, which exacerbate its already fragile atmospheric conditions. Traditional pollution mitigation strategies, such as static air purifying installations, often fail to maximize their impact due to suboptimal placement and limited adaptability to dynamic urban environments. This study presents a novel deep reinforcement learning (DRL) framework to optimize the placement of air purification booths to improve the air quality index (AQI) in the city of Delhi. We employ Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm, to iteratively learn and identify high-impact locations based on multiple spatial and environmental factors, including population density, traffic patterns, industrial influence, and green space constraints. Our approach is benchmarked against conventional placement strategies, including random and greedy AQI-based methods, using multi-dimensional performance evaluation metrics such as AQI improvement, spatial coverage, population and traffic impact, and spatial entropy. Experimental results demonstrate that the RL-based approach outperforms baseline methods by achieving a balanced and effective distribution of air purification infrastructure. Notably, the DRL framework achieves an optimal trade-off between AQI reduction and high-coverage deployment, ensuring equitable environmental benefits across urban regions. The findings underscore the potential of AI-driven spatial optimization in advancing smart city initiatives and data-driven urban air quality management.

9. 【2505.00630】Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook

链接：https://arxiv.org/abs/2505.00630

作者：Muyi Bao,Shuchang Lyu,Zhaoyang Xu,Huiyu Zhou,Jinchang Ren,Shiming Xiang,Xiangtai Li,Guangliang Cheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, Neural Networks, Convolutional Neural, limited receptive fields, quadratic computational complexity

备注：

点击查看摘要

Abstract:Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (this https URL) to foster community-driven advancements.

10. 【2505.00627】Brain Foundation Models with Hypergraph Dynamic Adapter for Brain Disease Analysis

链接：https://arxiv.org/abs/2505.00627

作者：Zhongying Deng,Haoyu Wang,Ziyan Huang,Lipei Zhang,Angelica I. Aviles-Rivero,Chaoyu Liu,Junjun He,Zoe Kourtzi,Carola-Bibiane Schönlieb

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：present profound challenges, profound challenges due, brain foundation models, present profound, societal impact

备注： 35 pages, 4 figures

点击查看摘要

Abstract:Brain diseases, such as Alzheimer's disease and brain tumors, present profound challenges due to their complexity and societal impact. Recent advancements in brain foundation models have shown significant promise in addressing a range of brain-related tasks. However, current brain foundation models are limited by task and data homogeneity, restricted generalization beyond segmentation or classification, and inefficient adaptation to diverse clinical tasks. In this work, we propose SAM-Brain3D, a brain-specific foundation model trained on over 66,000 brain image-label pairs across 14 MRI sub-modalities, and Hypergraph Dynamic Adapter (HyDA), a lightweight adapter for efficient and effective downstream adaptation. SAM-Brain3D captures detailed brain-specific anatomical and modality priors for segmenting diverse brain targets and broader downstream tasks. HyDA leverages hypergraphs to fuse complementary multi-modal data and dynamically generate patient-specific convolutional kernels for multi-scale feature fusion and personalized patient-wise adaptation. Together, our framework excels across a broad spectrum of brain disease segmentation and classification tasks. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art approaches, offering a new paradigm for brain disease analysis through multi-modal, multi-scale, and dynamic foundation modeling.

11. 【2505.00619】Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification

链接：https://arxiv.org/abs/2505.00619

作者：Neng Dong,Shuanglin Yan,Liyan Zhang,Jinhui Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visible-Infrared Person Re-Identification, Semantics-guided Feature Alignment, Diverse Semantics-guided Feature, challenging task due, Visible-Infrared Person

备注：

点击查看摘要

Abstract:Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images, which complicates the alignment of their features into a suitable common space. Moreover, style noise, such as illumination and color contrast, reduces the identity discriminability and modality invariance of features. To address these challenges, we propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space and disentangle identity-irrelevant features within each modality. Specifically, we develop a Diverse Semantics-guided Feature Alignment (DSFA) module, which generates pedestrian descriptions with diverse sentence structures to guide the cross-modality alignment of visual features. Furthermore, to filter out style information, we propose a Semantic Margin-guided Feature Decoupling (SMFD) module, which decomposes visual features into pedestrian-related and style-related components, and then constrains the similarity between the former and the textual embeddings to be at least a margin higher than that between the latter and the textual embeddings. Additionally, to prevent the loss of pedestrian semantics during feature decoupling, we design a Semantic Consistency-guided Feature Restitution (SCFR) module, which further excavates useful information for identification from the style-related features and restores it back into the pedestrian-related features, and then constrains the similarity between the features after restitution and the textual embeddings to be consistent with that between the features before decoupling and the textual embeddings. Extensive experiments on three VI-ReID datasets demonstrate the superiority of our DSFAD.

12. 【2505.00615】Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

链接：https://arxiv.org/abs/2505.00615

作者：Simon Giebenhain,Tobias Kirschstein,Martin Rünz,Lourdes Agapito,Matthias Nießner

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：single RGB image, single RGB, RGB image, RGB, human faces

备注： Project Website: [this https URL](https://simongiebenhain.github.io/pixel3dmm/) ; Video: [this https URL](https://www.youtube.com/watch?v=BwxwEXJwUDc)

点击查看摘要

Abstract:We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.

13. 【2505.00606】Dietary Intake Estimation via Continuous 3D Reconstruction of Food

链接：https://arxiv.org/abs/2505.00606

作者：Wallace Lee,YuHao Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：preventing health risks, including obesity, overeating and undereating, cardiovascular diseases, habits is crucial

备注： 2025 CVPR MetaFood Workshop

点击查看摘要

Abstract:Monitoring dietary habits is crucial for preventing health risks associated with overeating and undereating, including obesity, diabetes, and cardiovascular diseases. Traditional methods for tracking food intake rely on self-reported data before or after the eating, which are prone to inaccuracies. This study proposes an approach to accurately monitor ingest behaviours by leveraging 3D food models constructed from monocular 2D video. Using COLMAP and pose estimation algorithms, we generate detailed 3D representations of food, allowing us to observe changes in food volume as it is consumed. Experiments with toy models and real food items demonstrate the approach's potential. Meanwhile, we have proposed a new methodology for automated state recognition challenges to accurately detect state changes and maintain model fidelity. The 3D reconstruction approach shows promise in capturing comprehensive dietary behaviour insights, ultimately contributing to the development of automated and accurate dietary monitoring tools.

14. 【2505.00599】Visual Trajectory Prediction of Vessels for Inland Navigation

链接：https://arxiv.org/abs/2505.00599

作者：Alexander Puzicha,Konstantin Wüstefeld,Kathrin Wilms,Frank Weichert

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：navigation increasingly relies, accurate vessel trajectory, remote operations, increasingly relies, vessel trajectory prediction

备注：

点击查看摘要

Abstract:The future of inland navigation increasingly relies on autonomous systems and remote operations, emphasizing the need for accurate vessel trajectory prediction. This study addresses the challenges of video-based vessel tracking and prediction by integrating advanced object detection methods, Kalman filters, and spline-based interpolation. However, existing detection systems often misclassify objects in inland waterways due to complex surroundings. A comparative evaluation of tracking algorithms, including BoT-SORT, Deep OC-SORT, and ByeTrack, highlights the robustness of the Kalman filter in providing smoothed trajectories. Experimental results from diverse scenarios demonstrate improved accuracy in predicting vessel movements, which is essential for collision avoidance and situational awareness. The findings underline the necessity of customized datasets and models for inland navigation. Future work will expand the datasets and incorporate vessel classification to refine predictions, supporting both autonomous systems and human operators in complex environments.

15. 【2505.00592】Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

链接：https://arxiv.org/abs/2505.00592

作者：Shuo Tong,Shangde Gao,Ke Liu,Zihang Huang,Hongxia Xu,Haochao Ying,Jian Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：accurate patient assessments, Automatic disease image, Automatic disease, intelligence for healthcare, enabling faster

备注：

点击查看摘要

Abstract:Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-experts \textbf{K}nowledge \textbf{D}istillation (UMKD) framework to transfer knowledge from multiple expert models to a single student model. Specifically, to extract discriminative features, UMKD decouples task-agnostic and task-specific features with shallow and compact feature alignment in the feature space. At the output space, an uncertainty-aware decoupled distillation (UDD) mechanism dynamically adjusts knowledge transfer weights based on expert model uncertainties, ensuring robust and reliable distillation. Additionally, UMKD also tackles the problems of model architecture heterogeneity and distribution discrepancies between source and target domains, which are inadequately tackled by previous KD approaches. Extensive experiments on histology prostate grading (\textit{SICAPv2}) and fundus image grading (\textit{APTOS}) demonstrate that UMKD achieves a new state-of-the-art in both source-imbalanced and target-imbalanced scenarios, offering a robust and practical solution for real-world disease image grading.

16. 【2505.00584】Synthesizing and Identifying Noise Levels in Autonomous Vehicle Camera Radar Datasets

链接：https://arxiv.org/abs/2505.00584

作者：Mathis Morales,Golnaz Habibi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)

关键词：Detecting and tracking, autonomous navigation method, crucial component, Detecting, autonomous navigation

备注：

点击查看摘要

Abstract:Detecting and tracking objects is a crucial component of any autonomous navigation method. For the past decades, object detection has yielded promising results using neural networks on various datasets. While many methods focus on performance metrics, few projects focus on improving the robustness of these detection and tracking pipelines, notably to sensor failures. In this paper we attempt to address this issue by creating a realistic synthetic data augmentation pipeline for camera-radar Autonomous Vehicle (AV) datasets. Our goal is to accurately simulate sensor failures and data deterioration due to real-world interferences. We also present our results of a baseline lightweight Noise Recognition neural network trained and tested on our augmented dataset, reaching an overall recognition accuracy of 54.4\% on 11 categories across 10086 images and 2145 radar point-clouds.

17. 【2505.00569】AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis

链接：https://arxiv.org/abs/2505.00569

作者：Enmin Zhong,Carlos R. del-Blanco,Daniel Berjón,Fernando Jaureguizar,Narciso García

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：applying deep learning, deep learning techniques, leveraging pre-trained visual, pre-trained visual language, remarkable generalization capacity

备注： 6 pages, 3 figures,Accepted for the poster session at the CV4Animals workshop: Computer Vision for Animal Behavior Tracking and Modeling In conjunction with Computer Vision and Pattern Recognition 2024

点击查看摘要

Abstract:Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrating motion information and devising an effective temporal modeling scheme. In this paper, we propose AnimalMotionCLIP to address these challenges by interleaving video frames and optical flow information in the CLIP framework. Additionally, several temporal modeling schemes using an aggregation of classifiers are proposed and compared: dense, semi dense, and sparse. As a result, fine temporal actions can be correctly recognized, which is of vital importance in animal behavior analysis. Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.

18. 【2505.00568】Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

链接：https://arxiv.org/abs/2505.00568

作者：Lucas Robinet,Ahmad Berjaoui,Elizabeth Cohen-Jonathan Moyal

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：providing crucial insights, Multimodal magnetic resonance, magnetic resonance imaging, treatment monitoring, brain tumors

备注：

点击查看摘要

Abstract:Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value. Code and trained models are available at: this https URL

19. 【2505.00564】X-ray illicit object detection using hybrid CNN-transformer neural network architectures

链接：https://arxiv.org/abs/2505.00564

作者：Jorgen Cani,Christos Diou,Spyridon Evangelatos,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significantly impact outcomes, X-ray security applications, impact outcomes, smallest details, details can significantly

备注：

点击查看摘要

Abstract:In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at this https URL.

20. 【2505.00534】A Robust Deep Networks based Multi-Object MultiCamera Tracking System for City Scale Traffic

链接：https://arxiv.org/abs/2505.00534

作者：Muhammad Imran Zaman,Usama Ijaz Bajwa,Gulshan Saleem,Rana Hammad Raza

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Intelligent Transportation Systems, Transportation Systems, Intelligent Transportation, Vision sensors, important in Intelligent

备注：

点击查看摘要

Abstract:Vision sensors are becoming more important in Intelligent Transportation Systems (ITS) for traffic monitoring, management, and optimization as the number of network cameras continues to rise. However, manual object tracking and matching across multiple non-overlapping cameras pose significant challenges in city-scale urban traffic scenarios. These challenges include handling diverse vehicle attributes, occlusions, illumination variations, shadows, and varying video resolutions. To address these issues, we propose an efficient and cost-effective deep learning-based framework for Multi-Object Multi-Camera Tracking (MO-MCT). The proposed framework utilizes Mask R-CNN for object detection and employs Non-Maximum Suppression (NMS) to select target objects from overlapping detections. Transfer learning is employed for re-identification, enabling the association and generation of vehicle tracklets across multiple cameras. Moreover, we leverage appropriate loss functions and distance measures to handle occlusion, illumination, and shadow challenges. The final solution identification module performs feature extraction using ResNet-152 coupled with Deep SORT based vehicle tracking. The proposed framework is evaluated on the 5th AI City Challenge dataset (Track 3), comprising 46 camera feeds. Among these 46 camera streams, 40 are used for model training and validation, while the remaining six are utilized for model testing. The proposed framework achieves competitive performance with an IDF1 score of 0.8289, and precision and recall scores of 0.9026 and 0.8527 respectively, demonstrating its effectiveness in robust and accurate vehicle tracking.

21. 【2505.00512】InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method

链接：https://arxiv.org/abs/2505.00512

作者：Nguyen Hoang Khoi Tran,Julie Stephany Berrio,Mao Shan,Zhenxing Ming,Stewart Worrall

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：functional key points, geometric and functional, functional key, key points, road network

备注：

点击查看摘要

Abstract:Intersections are geometric and functional key points in every road network. They offer strong landmarks to correct GNSS dropouts and anchor new sensor data in up-to-date maps. Despite that importance, intersection detectors either ignore the rich semantic information already computed onboard or depend on scarce, hand-labeled intersection datasets. To close that gap, this paper presents a LiDAR-based method for intersection detection that (i) fuses semantic road segmentation with vehicle localization to detect intersection candidates in a bird's eye view (BEV) representation and (ii) refines those candidates by analyzing branch topology with a least squares formulation. To evaluate our method, we introduce an automated benchmarking pipeline that pairs detections with OpenStreetMap (OSM) intersection nodes using precise GNSS/INS ground-truth poses. Tested on eight SemanticKITTI sequences, the approach achieves a mean localization error of 1.9 m, 89% precision, and 77% recall at a 5 m tolerance, outperforming the latest learning-based baseline. Moreover, the method is robust to segmentation errors higher than those of the benchmark model, demonstrating its applicability in the real world.

22. 【2505.00511】Inconsistency-based Active Learning for LiDAR Object Detection

链接：https://arxiv.org/abs/2505.00511

作者：Esteban Rivera,Loic Stratil,Markus Lienkamp

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently achieved impressive, achieved impressive performance, impressive performance gains, Deep learning models, vehicles worldwide

备注： Accepted in IV2025

点击查看摘要

Abstract:Deep learning models for object detection in autonomous driving have recently achieved impressive performance gains and are already being deployed in vehicles worldwide. However, current models require increasingly large datasets for training. Acquiring and labeling such data is costly, necessitating the development of new strategies to optimize this process. Active learning is a promising approach that has been extensively researched in the image domain. In our work, we extend this concept to the LiDAR domain by developing several inconsistency-based sample selection strategies and evaluate their effectiveness in various settings. Our results show that using a naive inconsistency approach based on the number of detected boxes, we achieve the same mAP as the random sampling strategy with 50% of the labeled data.

23. 【2505.00507】HeAL3D: Heuristical-enhanced Active Learning for 3D Object Detection

链接：https://arxiv.org/abs/2505.00507

作者：Esteban Rivera,Surya Prabhakaran,Markus Lienkamp

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous Driving, Heuristical-enhanced Active Learning, Active Learning, perform sample selection, sample selection

备注： Accepted in CVPR2025

点击查看摘要

Abstract:Active Learning has proved to be a relevant approach to perform sample selection for training models for Autonomous Driving. Particularly, previous works on active learning for 3D object detection have shown that selection of samples in uncontrolled scenarios is challenging. Furthermore, current approaches focus exclusively on the theoretical aspects of the sample selection problem but neglect the practical insights that can be obtained from the extensive literature and application of 3D detection models. In this paper, we introduce HeAL (Heuristical-enhanced Active Learning for 3D Object Detection) which integrates those heuristical features together with Localization and Classification to deliver the most contributing samples to the model's training. In contrast to previous works, our approach integrates heuristical features such as object distance and point-quantity to estimate the uncertainty, which enhance the usefulness of selected samples to train detection models. Our quantitative evaluation on KITTI shows that HeAL presents competitive mAP with respect to the State-of-the-Art, and achieves the same mAP as the full-supervised baseline with only 24% of the samples.

24. 【2505.00502】owards Scalable Human-aligned Benchmark for Text-guided Image Editing

链接：https://arxiv.org/abs/2505.00502

作者：Suho Ryu,Kihyun Kim,Eugene Baek,Dongsoo Shin,Joonseok Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：text-guided image editing, text-guided image, proposed recently, image editing, image editing models

备注： Accepted to CVPR 2025 (highlight)

点击查看摘要

Abstract:A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.

25. 【2505.00497】KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

链接：https://arxiv.org/abs/2505.00497

作者：Antoni Bigata,Rodrigo Mira,Stella Bounareli,Michał Stypułkowski,Konstantinos Vougioukas,Stavros Petridis,Maja Pantic

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：audio-driven facial animation, aligning lip movements, task of aligning, typically framed, simpler variant

备注：

点击查看摘要

Abstract:Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at this https URL.

26. 【2505.00482】JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

链接：https://arxiv.org/abs/2505.00482

作者：Kwon Byung-Ki,Qi Dai,Lee Hyoseok,Chong Luo,Tae-Hyun Oh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：diffusion transformer, joint distribution, RGB, joint distribution modeling, distribution of RGB

备注：

点击查看摘要

Abstract:We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at this https URL.

27. 【2505.00452】ClearLines - Camera Calibration from Straight Lines

链接：https://arxiv.org/abs/2505.00452

作者：Gregory Schroeder,Mohamed Sabry,Cristina Olaverri-Monreal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：geometric computer vision, well-established theoretical foundations, computer vision, theoretical foundations, problem of calibration

备注：

点击查看摘要

Abstract:The problem of calibration from straight lines is fundamental in geometric computer vision, with well-established theoretical foundations. However, its practical applicability remains limited, particularly in real-world outdoor scenarios. These environments pose significant challenges due to diverse and cluttered scenes, interrupted reprojections of straight 3D lines, and varying lighting conditions, making the task notoriously difficult. Furthermore, the field lacks a dedicated dataset encouraging the development of respective detection algorithms. In this study, we present a small dataset named "ClearLines", and by detailing its creation process, provide practical insights that can serve as a guide for developing and refining straight 3D line detection algorithms.

28. 【2505.00426】Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly

链接：https://arxiv.org/abs/2505.00426

作者：Ruiyuan Zhang,Qi Wang,Jiaxiang Liu,Yu Zhang,Yuchi Huo,Chao Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：understand part relationships, part assembly aims, poses to construct, addressing the growing, crucial for robots

备注： 10 pages, 12 figures, Accepted by IJCAI-2025

点击查看摘要

Abstract:3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on this https URL.

29. 【2505.00421】Real-Time Animatable 2DGS-Avatars with Detail Enhancement from Monocular Videos

链接：https://arxiv.org/abs/2505.00421

作者：Xia Yuan,Hai Yuan,Wenyi Ge,Ying Fu,Xi Wu,Guanyu Xing

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offers significant potential, videos offers significant, human avatar reconstruction, augmented reality, game development

备注：

点击查看摘要

Abstract:High-quality, animatable 3D human avatar reconstruction from monocular videos offers significant potential for reducing reliance on complex hardware, making it highly practical for applications in game development, augmented reality, and social media. However, existing methods still face substantial challenges in capturing fine geometric details and maintaining animation stability, particularly under dynamic or complex poses. To address these issues, we propose a novel real-time framework for animatable human avatar reconstruction based on 2D Gaussian Splatting (2DGS). By leveraging 2DGS and global SMPL pose parameters, our framework not only aligns positional and rotational discrepancies but also enables robust and natural pose-driven animation of the reconstructed avatars. Furthermore, we introduce a Rotation Compensation Network (RCN) that learns rotation residuals by integrating local geometric features with global pose parameters. This network significantly improves the handling of non-rigid deformations and ensures smooth, artifact-free pose transitions during animation. Experimental results demonstrate that our method successfully reconstructs realistic and highly animatable human avatars from monocular videos, effectively preserving fine-grained details while ensuring stable and natural pose variation. Our approach surpasses current state-of-the-art methods in both reconstruction quality and animation robustness on public benchmarks.

30. 【2505.00394】SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos

链接：https://arxiv.org/abs/2505.00394

作者：Wenxuan Liu,Yao Deng,Kang Chen,Xian Zhong,Zhaofei Yu,Tiejun Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world scenarios due, blur and occlusions, struggle in real-world, real-world scenarios, scenarios due

备注： Accepted to IJCAI 2025

点击查看摘要

Abstract:Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at this https URL.

31. 【2505.00380】he Invisible Threat: Evaluating the Vulnerability of Cross-Spectral Face Recognition to Presentation Attacks

链接：https://arxiv.org/abs/2505.00380

作者：Anjith George,Sebastien Marcel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：challenging operational conditions, enabling cross-modal matching, Cross-spectral face recognition, operational conditions, designed to enhance

备注： 10 pages

点击查看摘要

Abstract:Cross-spectral face recognition systems are designed to enhance the performance of facial recognition systems by enabling cross-modal matching under challenging operational conditions. A particularly relevant application is the matching of near-infrared (NIR) images to visible-spectrum (VIS) images, enabling the verification of individuals by comparing NIR facial captures acquired with VIS reference images. The use of NIR imaging offers several advantages, including greater robustness to illumination variations, better visibility through glasses and glare, and greater resistance to presentation attacks. Despite these claimed benefits, the robustness of NIR-based systems against presentation attacks has not been systematically studied in the literature. In this work, we conduct a comprehensive evaluation into the vulnerability of NIR-VIS cross-spectral face recognition systems to presentation attacks. Our empirical findings indicate that, although these systems exhibit a certain degree of reliability, they remain vulnerable to specific attacks, emphasizing the need for further research in this area.

32. 【2505.00378】Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation

链接：https://arxiv.org/abs/2505.00378

作者：Feng Xue,Wenzhuang Xu,Guofeng Zhong,Anlong Minga,Nicu Sebe

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significant trend, recently emerged, Neural Radiance Field, Open-vocabulary, Neural Radiance

备注： Accepted by Information Fusion

点击查看摘要

Abstract:Open-vocabulary 3D panoptic segmentation has recently emerged as a significant trend. Top-performing methods currently integrate 2D segmentation with geometry-aware 3D primitives. However, the advantage would be lost without high-fidelity 3D point clouds, such as methods based on Neural Radiance Field (NeRF). These methods are limited by the insufficient capacity to maintain consistency across partial observations. To address this, recent works have utilized contrastive loss or cross-view association pre-processing for view consensus. In contrast to them, we present Cues3D, a compact approach that relies solely on NeRF instead of pre-associations. The core idea is that NeRF's implicit 3D field inherently establishes a globally consistent geometry, enabling effective object distinction without explicit cross-view supervision. We propose a three-phase training framework for NeRF, initialization-disambiguation-refinement, whereby the instance IDs are corrected using the initially-learned knowledge. Additionally, an instance disambiguation method is proposed to match NeRF-rendered 3D masks and ensure globally unique 3D instance identities. With the aid of Cues3D, we obtain highly consistent and unique 3D instance ID for each object across views with a balanced version of NeRF. Our experiments are conducted on ScanNet v2, ScanNet200, ScanNet++, and Replica datasets for 3D instance, panoptic, and semantic segmentation tasks. Cues3D outperforms other 2D image-based methods and competes with the latest 2D-3D merging based methods, while even surpassing them when using additional 3D point clouds. The code link could be found in the appendix and will be released on \href{this https URL}{github}

33. 【2505.00369】Automated segmenta-on of pediatric neuroblastoma on multi-modal MRI: Results of the SPPIN challenge at MICCAI 2023

链接：https://arxiv.org/abs/2505.00369

作者：M.A.D. Buser,D.C. Simons,M. Fitski,M.H.W.A. Wijnen,A.S. Littooij,A.H. ter Brugge,I.N. Vos,M.H.A. Janse,M. de Boer,R. ter Maat,J. Sato,S. Kido,S. Kondo,S. Kasai,M. Wodzinski,H. Muller,J. Ye,J. He,Y. Kirchhoff,M.R. Rokkus,G. Haokai,S. Zitong,M. Fernández-Patón,D. Veiga-Canuto,D.G. Ellis,M.R. Aizenberg,B.H.M. van der Velden,H. Kuijf,A. De Luca,A.F.W. van der Steeg

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Surgery plays, common pediatric cancer, MRI, MRI scans, post-chemotherapy MRI scans

备注： 23 pages, 6 figures

点击查看摘要

Abstract:Surgery plays an important role within the treatment for neuroblastoma, a common pediatric cancer. This requires careful planning, often via magnetic resonance imaging (MRI)-based anatomical 3D models. However, creating these models is often time-consuming and user dependent. We organized the Surgical Planning in Pediatric Neuroblastoma (SPPIN) challenge, to stimulate developments on this topic, and set a benchmark for fully automatic segmentation of neuroblastoma on multi-model MRI. The challenge started with a training phase, where teams received 78 sets of MRI scans from 34 patients, consisting of both diagnostic and post-chemotherapy MRI scans. The final test phase, consisting of 18 MRI sets from 9 patients, determined the ranking of the teams. Ranking was based on the Dice similarity coefficient (Dice score), the 95th percentile of the Hausdorff distance (HD95) and the volumetric similarity (VS). The SPPIN challenge was hosted at MICCAI 2023. The final leaderboard consisted of 9 teams. The highest-ranking team achieved a median Dice score 0.82, a median HD95 of 7.69 mm and a VS of 0.91, utilizing a large, pretrained network called STU-Net. A significant difference for the segmentation results between diagnostic and post-chemotherapy MRI scans was observed (Dice = 0.89 vs Dice = 0.59, P = 0.01) for the highest-ranking team. SPPIN is the first medical segmentation challenge in extracranial pediatric oncology. The highest-ranking team used a large pre-trained network, suggesting that pretraining can be of use in small, heterogenous datasets. Although the results of the highest-ranking team were high for most patients, segmentation especially in small, pre-treated tumors were insufficient. Therefore, more reliable segmentation methods are needed to create clinically applicable models to aid surgical planning in pediatric neuroblastoma.

34. 【2505.00337】2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

链接：https://arxiv.org/abs/2505.00337

作者：Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：user engagement online, made significant strides, digital art creation, producing high-quality videos, recent years

备注：

点击查看摘要

35. 【2505.00335】Efficient Neural Video Representation with Temporally Coherent Modulation

链接：https://arxiv.org/abs/2505.00335

作者：Seungjun Shin,Suji Kim,Dokwan Oh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Implicit neural representations, found successful applications, Implicit neural, video, found successful

备注： ECCV 2024

点击查看摘要

Abstract:Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video's dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is this https URL.

36. 【2505.00334】Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution

链接：https://arxiv.org/abs/2505.00334

作者：Luigi Sigillo,Christian Bianchi,Danilo Comminiello

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：broad applications spacing, satellite analysis, fundamental problem, problem in computer, computer vision

备注： Accepted for presentation at IJCNN 2025

点击查看摘要

Abstract:Image Super-Resolution is a fundamental problem in computer vision with broad applications spacing from medical imaging to satellite analysis. The ability to reconstruct high-resolution images from low-resolution inputs is crucial for enhancing downstream tasks such as object detection and segmentation. While deep learning has significantly advanced SR, achieving high-quality reconstructions with fine-grained details and realistic textures remains challenging, particularly at high upscaling factors. Recent approaches leveraging diffusion models have demonstrated promising results, yet they often struggle to balance perceptual quality with structural fidelity. In this work, we introduce ResQu a novel SR framework that integrates a quaternion wavelet preprocessing framework with latent diffusion models, incorporating a new quaternion wavelet- and time-aware encoder. Unlike prior methods that simply apply wavelet transforms within diffusion models, our approach enhances the conditioning process by exploiting quaternion wavelet embeddings, which are dynamically integrated at different stages of denoising. Furthermore, we also leverage the generative priors of foundation models such as Stable Diffusion. Extensive experiments on domain-specific datasets demonstrate that our method achieves outstanding SR results, outperforming in many cases existing approaches in perceptual quality and standard evaluation metrics. The code will be available after the revision process.

37. 【2505.00312】AWARE-NET: Adaptive Weighted Averaging for Robust Ensemble Network in Deepfake Detection

链接：https://arxiv.org/abs/2505.00312

作者：Muhammad Salman,Iqra Tariq,Mishal Zulfiqar,Muqadas Jalal,Sami Aujla,Sumbal Fatima

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：increasingly important due, poses significant risks, AUC scores, synthetic media, security and trust

备注：

点击查看摘要

Abstract:Deepfake detection has become increasingly important due to the rise of synthetic media, which poses significant risks to digital identity and cyber presence for security and trust. While multiple approaches have improved detection accuracy, challenges remain in achieving consistent performance across diverse datasets and manipulation types. In response, we propose a novel two-tier ensemble framework for deepfake detection based on deep learning that hierarchically combines multiple instances of three state-of-the-art architectures: Xception, Res2Net101, and EfficientNet-B7. Our framework employs a unique approach where each architecture is instantiated three times with different initializations to enhance model diversity, followed by a learnable weighting mechanism that dynamically combines their predictions. Unlike traditional fixed-weight ensembles, our first-tier averages predictions within each architecture family to reduce model variance, while the second tier learns optimal contribution weights through backpropagation, automatically adjusting each architecture's influence based on their detection reliability. Our experiments achieved state-of-the-art intra-dataset performance with AUC scores of 99.22% (FF++) and 100.00% (CelebDF-v2), and F1 scores of 98.06% (FF++) and 99.94% (CelebDF-v2) without augmentation. With augmentation, we achieve AUC scores of 99.47% (FF++) and 100.00% (CelebDF-v2), and F1 scores of 98.43% (FF++) and 99.95% (CelebDF-v2). The framework demonstrates robust cross-dataset generalization, achieving AUC scores of 88.20% and 72.52%, and F1 scores of 93.16% and 80.62% in cross-dataset evaluations.

38. 【2505.00308】AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

链接：https://arxiv.org/abs/2505.00308

作者：Biling Wang,Austen Maniscalco,Ti Bai,Siqiu Wang,Michael Dohopolski,Mu-Han Lin,Chenyang Shen,Dan Nguyen,Junzhou Huang,Steve Jiang,Xinlei Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)

关键词：Online Adaptive Radiotherapy, Online Adaptive, Bayesian Ordinal Classification, presents a Deep, emphasis on Online

备注：

点击查看摘要

Abstract:Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive manual labeling. Methods: We developed a BOC model to classify auto-contour quality and quantify prediction uncertainty. A calibration step was used to optimize uncertainty thresholds that meet clinical accuracy needs. The method was validated under three data scenarios: no manual labels, limited labels, and extensive labels. For rectum contours in prostate cancer, we applied geometric surrogate labels when manual labels were absent, transfer learning when limited, and direct supervision when ample labels were available. Results: The BOC model delivered robust performance across all scenarios. Fine-tuning with just 30 manual labels and calibrating with 34 subjects yielded over 90% accuracy on test data. Using the calibrated threshold, over 93% of the auto-contours' qualities were accurately predicted in over 98% of cases, reducing unnecessary manual reviews and highlighting cases needing correction. Conclusion: The proposed QA model enhances contouring efficiency in OART by reducing manual workload and enabling fast, informed clinical decisions. Through uncertainty quantification, it ensures safer, more reliable radiotherapy workflows.

39. 【2505.00295】Fine-grained spatial-temporal perception for gas leak segmentation

链接：https://arxiv.org/abs/2505.00295

作者：Xinlong Zhao,Shan Du

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：pose significant risks, leaks pose significant, Gas leaks pose, pose significant, significant risks

备注： 6 pages, 4 figures, ICIP 2025 Conference

点击查看摘要

Abstract:Gas leaks pose significant risks to human health and the environment. Despite long-standing concerns, there are limited methods that can efficiently and accurately detect and segment leaks due to their concealed appearance and random shapes. In this paper, we propose a Fine-grained Spatial-Temporal Perception (FGSTP) algorithm for gas leak segmentation. FGSTP captures critical motion clues across frames and integrates them with refined object features in an end-to-end network. Specifically, we first construct a correlation volume to capture motion information between consecutive frames. Then, the fine-grained perception progressively refines the object-level features using previous outputs. Finally, a decoder is employed to optimize boundary segmentation. Because there is no highly precise labeled dataset for gas leak segmentation, we manually label a gas leak video dataset, GasVid. Experimental results on GasVid demonstrate that our model excels in segmenting non-rigid objects such as gas leaks, generating the most accurate mask compared to other state-of-the-art (SOTA) models.

40. 【2505.00275】AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care

链接：https://arxiv.org/abs/2505.00275

作者：Md Asaduzzaman Jabin,Hanqi Jiang,Yiwei Li,Patrick Kaggwa,Eugene Douglass,Juliet N. Sekandi,Tianming Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：avert disease progression, decrease mortality rates, Chronic diseases, necessitate rigorous adherence, disease progression

备注：

点击查看摘要

Abstract:Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient's face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

41. 【2505.00259】Pack-PTQ: Advancing Post-training Quantization of Neural Networks by Pack-wise Reconstruction

链接：https://arxiv.org/abs/2505.00259

作者：Changjun Li,Runqing Jiang,Zhuo Song,Pengpeng Yu,Ye Zhang,Yulan Guo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：compressing complex models, small calibration dataset, complex models, dataset and avoids, Post-training quantization

备注：

点击查看摘要

Abstract:Post-training quantization (PTQ) has evolved as a prominent solution for compressing complex models, which advocates a small calibration dataset and avoids end-to-end retraining. However, most existing PTQ methods employ block-wise reconstruction, which neglects cross-block dependency and exhibits a notable accuracy drop in low-bit cases. To address these limitations, this paper presents a novel PTQ method, dubbed Pack-PTQ. First, we design a Hessian-guided adaptive packing mechanism to partition blocks into non-overlapping packs, which serve as the base unit for reconstruction, thereby preserving the cross-block dependency and enabling accurate quantization parameters estimation. Second, based on the pack configuration, we propose a mixed-precision quantization approach to assign varied bit-widths to packs according to their distinct sensitivities, thereby further enhancing performance. Extensive experiments on 2D image and 3D point cloud classification tasks, using various network architectures, demonstrate the superiority of our method over the state-of-the-art PTQ methods.

42. 【2505.00254】Empowering Agentic Video Analytics Systems with Video Language Models

链接：https://arxiv.org/abs/2505.00254

作者：Yuxuan Yan,Shiqi Jiang,Ting Cao,Yifan Yang,Qianqian Yang,Yuanchao Shu,Yuqing Yang,Lili Qiu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：increasingly pivotal, AI-driven video analytics, video, Event Knowledge Graphs, AVA

备注： 15 pages

点击查看摘要

Abstract:AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.

43. 【2505.00228】ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

链接：https://arxiv.org/abs/2505.00228

作者：Xiaoman Zhang,Julián N. Acosta,Josh Miller,Ouwen Huang,Pranav Rajpurkar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：chest X-ray, chest X-ray dataset, chest X-ray studies, representing the largest, largest publicly

备注：

点击查看摘要

Abstract:We present ReXGradient-160K, representing the largest publicly available chest X-ray dataset to date in terms of the number of patients. This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems (79 medical sites). This comprehensive dataset includes multiple images per study and detailed radiology reports, making it particularly valuable for the development and evaluation of AI systems for medical imaging and automated report generation models. The dataset is divided into training (140,000 studies), validation (10,000 studies), and public test (10,000 studies) sets, with an additional private test set (10,000 studies) reserved for model evaluation on the ReXrank benchmark. By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis. Our dataset will be open-sourced at this https URL.

44. 【2505.00220】owards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework

链接：https://arxiv.org/abs/2505.00220

作者：Ankit Amrutkar,Björn Kampa,Volkmar Schulz,Johannes Stegmaier,Markus Rothermel,Dorit Merhof

类目：Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

关键词：holographic augmented reality, Computer-generated holography, systems neuroscience, augmented reality, optical trapping

备注：

点击查看摘要

Abstract:Computer-generated holography (CGH) enables applications in holographic augmented reality (AR), 3D displays, systems neuroscience, and optical trapping. The fundamental challenge in CGH is solving the inverse problem of phase retrieval from intensity measurements. Physics-inspired neural networks (PINNs), especially Gerchberg-Saxton-based PINNs (GS-PINNs), have advanced phase retrieval capabilities. However, their performance strongly depends on forward models (FMs) and their hyperparameters (FMHs), limiting generalization, complicating benchmarking, and hindering hardware optimization. We present a systematic sensitivity analysis framework based on Saltelli's extension of Sobol's method to quantify FMH impacts on GS-PINN performance. Our analysis demonstrates that SLM pixel-resolution is the primary factor affecting neural network sensitivity, followed by pixel-pitch, propagation distance, and wavelength. Free space propagation forward models demonstrate superior neural network performance compared to Fourier holography, providing enhanced parameterization and generalization. We introduce a composite evaluation metric combining performance consistency, generalization capability, and hyperparameter perturbation resilience, establishing a unified benchmarking standard across CGH configurations. Our research connects physics-inspired deep learning theory with practical CGH implementations through concrete guidelines for forward model selection, neural network architecture, and performance evaluation. Our contributions advance the development of robust, interpretable, and generalizable neural networks for diverse holographic applications, supporting evidence-based decisions in CGH research and implementation.

45. 【2505.00209】Direct Motion Models for Assessing Generated Videos

链接：https://arxiv.org/abs/2505.00209

作者：Kelsey Allen,Carl Doersch,Guangyao Zhou,Mohammed Suhail,Danny Driess,Ignacio Rocco,Yulia Rubanova,Thomas Kipf,Mehdi S. M. Sajjadi,Kevin Murphy,Joao Carreira,Sjoerd van Steenkiste

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：current limitation, popular methods, generate plausible, captured by FVD, methods for evaluating

备注： Project page: [this http URL](http://trajan-paper.github.io)

点击查看摘要

Abstract:A current limitation of video generative video models is that they generate plausible looking frames, but poor motion -- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: this http URL.

46. 【2505.00186】Neuroevolution of Self-Attention Over Proto-Objects

链接：https://arxiv.org/abs/2505.00186

作者：Rafael C. Pinto,Anderson R. Tavares

类目：Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：traditional attention mechanisms, attention mechanisms based, common visual properties, share common visual, offer a promising

备注： 9 pages, 16 figures, GECCO

点击查看摘要

Abstract:Proto-objects - image regions that share common visual properties - offer a promising alternative to traditional attention mechanisms based on rectangular-shaped image patches in neural networks. Although previous work demonstrated that evolving a patch-based hard-attention module alongside a controller network could achieve state-of-the-art performance in visual reinforcement learning tasks, our approach leverages image segmentation to work with higher-level features. By operating on proto-objects rather than fixed patches, we significantly reduce the representational complexity: each image decomposes into fewer proto-objects than regular patches, and each proto-object can be efficiently encoded as a compact feature vector. This enables a substantially smaller self-attention module that processes richer semantic information. Our experiments demonstrate that this proto-object-based approach matches or exceeds the state-of-the-art performance of patch-based implementations with 62% less parameters and 2.6 times less training time.

47. 【2505.00156】V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving

链接：https://arxiv.org/abs/2505.00156

作者：Jannik Lübberstedt,Esteban Rivera,Nico Uhlemann,Markus Lienkamp

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Language Models, Large Vision Language, Large Language Models, shown strong capabilities, Language Models

备注：

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have shown strong capabilities in understanding and analyzing visual scenes across various domains. However, in the context of autonomous driving, their limited comprehension of 3D environments restricts their effectiveness in achieving a complete and safe understanding of dynamic surroundings. To address this, we introduce V3LMA, a novel approach that enhances 3D scene understanding by integrating Large Language Models (LLMs) with LVLMs. V3LMA leverages textual descriptions generated from object detections and video inputs, significantly boosting performance without requiring fine-tuning. Through a dedicated preprocessing pipeline that extracts 3D object data, our method improves situational awareness and decision-making in complex traffic scenarios, achieving a score of 0.56 on the LingoQA benchmark. We further explore different fusion strategies and token combinations with the goal of advancing the interpretation of traffic scenes, ultimately enabling safer autonomous driving systems.

48. 【2505.00150】Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

链接：https://arxiv.org/abs/2505.00150

作者：Minh-Hao Van,Xintao Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：provided enhanced communication, enhanced communication channels, thoughts and opinions, rapid evolution, evolution of social

备注：

点击查看摘要

49. 【2505.00135】Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

链接：https://arxiv.org/abs/2505.00135

作者：Michal Geyer,Omer Tov,Linyi Jin,Richard Tucker,Inbar Mosseri,Tali Dekel,Noah Snavely

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：immersive visual experiences, interest in stereoscopic, rising popularity, popularity of immersive, immersive visual

备注：

点击查看摘要

Abstract:The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on this https URL

50. 【2505.00134】Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design

链接：https://arxiv.org/abs/2505.00134

作者：Vasudev Sharma,Ahmed Alagha,Abdelhakim Khellaf,Vincent Quoc-Huy Trinh,Mahdi S. Hosseini

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained significant attention, multimodal learning capabilities, enhance big-data analytics, Vision-language models, gained significant

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have gained significant attention in computational pathology due to their multimodal learning capabilities that enhance big-data analytics of giga-pixel whole slide image (WSI). However, their sensitivity to large-scale clinical data, task formulations, and prompt design remains an open question, particularly in terms of diagnostic accuracy. In this paper, we present a systematic investigation and analysis of three state of the art VLMs for histopathology, namely Quilt-Net, Quilt-LLAVA, and CONCH, on an in-house digestive pathology dataset comprising 3,507 WSIs, each in giga-pixel form, across distinct tissue types. Through a structured ablative study on cancer invasiveness and dysplasia status, we develop a comprehensive prompt engineering framework that systematically varies domain specificity, anatomical precision, instructional framing, and output constraints. Our findings demonstrate that prompt engineering significantly impacts model performance, with the CONCH model achieving the highest accuracy when provided with precise anatomical references. Additionally, we identify the critical importance of anatomical context in histopathological image analysis, as performance consistently degraded when reducing anatomical precision. We also show that model complexity alone does not guarantee superior performance, as effective domain alignment and domain-specific training are critical. These results establish foundational guidelines for prompt engineering in computational pathology and highlight the potential of VLMs to enhance diagnostic accuracy when properly instructed with domain-appropriate prompts.

51. 【2505.00063】GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

链接：https://arxiv.org/abs/2505.00063

作者：Siqi Li,Yufan Shen,Xiangnan Chen,Jiayi Chen,Hengwei Ju,Haodong Duan,Song Mao,Hongbin Zhou,Bo Zhang,Pinlong Cai,Licheng Wen,Botian Shi,Yong Liu,Xinyu Cai,Yu Qiao

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal large language, large language models, creating a wide, General Document Intelligence, rapid advancement

备注：

点击查看摘要

52. 【2505.00044】Learning to Borrow Features for Improved Detection of Small Objects in Single-Shot Detectors

链接：https://arxiv.org/abs/2505.00044

作者：Richard Schmit

类目：Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词：Detecting small objects, single-shot object detectors, object detectors due, Feature Matching Block, Feature Representing Block

备注：

点击查看摘要

Abstract:Detecting small objects remains a significant challenge in single-shot object detectors due to the inherent trade-off between spatial resolution and semantic richness in convolutional feature maps. To address this issue, we propose a novel framework that enables small object representations to "borrow" discriminative features from larger, semantically richer instances within the same class. Our architecture introduces three key components: the Feature Matching Block (FMB) to identify semantically similar descriptors across layers, the Feature Representing Block (FRB) to generate enhanced shallow features through weighted aggregation, and the Feature Fusion Block (FFB) to refine feature maps by integrating original, borrowed, and context information. Built upon the SSD framework, our method improves the descriptive capacity of shallow layers while maintaining real-time detection performance. Experimental results demonstrate that our approach significantly boosts small object detection accuracy over baseline methods, offering a promising direction for robust object detection in complex visual environments.

53. 【2504.21707】Recursive KL Divergence Optimization: A Dynamic Framework for Representation Learning

链接：https://arxiv.org/abs/2504.21707

作者：Anthony D Martin

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)

关键词：localized conditional distributions, Information Contrastive Learning, I-Con unify multiple, divergence alignment processes, fixed neighborhood conditionals

备注：

点击查看摘要

Abstract:We propose a generalization of modern representation learning objectives by reframing them as recursive divergence alignment processes over localized conditional distributions While recent frameworks like Information Contrastive Learning I-Con unify multiple learning paradigms through KL divergence between fixed neighborhood conditionals we argue this view underplays a crucial recursive structure inherent in the learning process. We introduce Recursive KL Divergence Optimization RKDO a dynamic formalism where representation learning is framed as the evolution of KL divergences across data neighborhoods. This formulation captures contrastive clustering and dimensionality reduction methods as static slices while offering a new path to model stability and local adaptation. Our experiments demonstrate that RKDO offers dual efficiency advantages approximately 30 percent lower loss values compared to static approaches across three different datasets and 60 to 80 percent reduction in computational resources needed to achieve comparable results. This suggests that RKDOs recursive updating mechanism provides a fundamentally more efficient optimization landscape for representation learning with significant implications for resource constrained applications.

54. 【2411.15923】Deep Learning for automated multi-scale functional field boundaries extraction using multi-date Sentinel-2 and PlanetScope imagery: Case Study of Netherlands and Pakistan

链接：https://arxiv.org/abs/2411.15923

作者：Saba Zahid,Sajid Ghuffar,Obaid-ur-Rehman,Syed Roshaan Ali Shah

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：multi-temporal satellite imagery, Pakistan, semantic segmentation architecture, learning semantic segmentation, Netherlands

备注： 09 pages, To be published

点击查看摘要

Abstract:This study explores the effectiveness of multi-temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi-scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel-2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self-crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi-date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi-date NDVI stack approach. These findings were then applied for transfer learning, using pre-trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self-crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi-date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi-scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi-scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.

55. 【2505.00687】GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution

链接：https://arxiv.org/abs/2505.00687

作者：Aditya Arora,Zhengzhong Tu,Yufei Wang,Ruizheng Bai,Jian Wang,Sizhuo Ma

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：model specifically designed, diffusion-based image super-resolution, enhance image fidelity, Guidance Branch, Image Guidance Network

备注：

点击查看摘要

Abstract:In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. GuideSR addresses this limitation by introducing a dual-branch architecture comprising: (1) a Guidance Branch that preserves high-fidelity structures from the original-resolution degraded input, and (2) a Diffusion Branch, which a pre-trained latent diffusion model to enhance perceptual quality. Unlike conventional conditioning mechanisms, our Guidance Branch features a tailored structure for image restoration tasks, combining Full Resolution Blocks (FRBs) with channel attention and an Image Guidance Network (IGN) with guided attention. By embedding detailed structural information directly into the restoration pipeline, GuideSR produces sharper and more visually consistent results. Extensive experiments on benchmark datasets demonstrate that GuideSR achieves state-of-the-art performance while maintaining the low computational cost of single-step approaches, with up to 1.39dB PSNR gain on challenging real-world datasets. Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement for real-world image restoration.

56. 【2505.00643】Deep Learning Assisted Outer Volume Removal for Highly-Accelerated Real-Time Dynamic MRI

链接：https://arxiv.org/abs/2505.00643

作者：Merve Gülle,Sebastian Weingärtner,Mehmet Akçakaya

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词：dynamic MRI plays, rapid physiological processes, capturing rapid physiological, offering unique insights, cine MRI

备注：

点击查看摘要

Abstract:Real-time (RT) dynamic MRI plays a vital role in capturing rapid physiological processes, offering unique insights into organ motion and function. Among these applications, RT cine MRI is particularly important for functional assessment of the heart with high temporal resolution. RT imaging enables free-breathing, ungated imaging of cardiac motion, making it a crucial alternative for patients who cannot tolerate conventional breath-hold, ECG-gated acquisitions. However, achieving high acceleration rates in RT cine MRI is challenging due to aliasing artifacts from extra-cardiac tissues, particularly at high undersampling factors. In this study, we propose a novel outer volume removal (OVR) method to address this challenge by eliminating aliasing contributions from non-cardiac regions in a post-processing framework. Our approach estimates the outer volume signal for each timeframe using composite temporal images from time-interleaved undersampling patterns, which inherently contain pseudo-periodic ghosting artifacts. A deep learning (DL) model is trained to identify and remove these artifacts, producing a clean outer volume estimate that is subsequently subtracted from the corresponding k-space data. The final reconstruction is performed with a physics-driven DL (PD-DL) method trained using an OVR-specific loss function to restore high spatio-temporal resolution images. Experimental results show that the proposed method at high accelerations achieves image quality that is visually comparable to clinical baseline images, while outperforming conventional reconstruction techniques, both qualitatively and quantitatively. The proposed approach provides a practical and effective solution for artifact reduction in RT cine MRI without requiring acquisition modifications, offering a pathway to higher acceleration rates while preserving diagnostic quality.

57. 【2505.00525】A Methodological and Structural Review of Parkinsons Disease Detection Across Diverse Data Modalities

链接：https://arxiv.org/abs/2505.00525

作者：Abu Saleh Musa Miah,taro Suzuki,Jungpil Shin

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：mild cognitive impairment, progressive neurological disorder, primarily affects motor, affects motor functions, Parkinsons Disease

备注：

点击查看摘要

Abstract:Parkinsons Disease (PD) is a progressive neurological disorder that primarily affects motor functions and can lead to mild cognitive impairment (MCI) and dementia in its advanced stages. With approximately 10 million people diagnosed globally 1 to 1.8 per 1,000 individuals, according to reports by the Japan Times and the Parkinson Foundation early and accurate diagnosis of PD is crucial for improving patient outcomes. While numerous studies have utilized machine learning (ML) and deep learning (DL) techniques for PD recognition, existing surveys are limited in scope, often focusing on single data modalities and failing to capture the potential of multimodal approaches. To address these gaps, this study presents a comprehensive review of PD recognition systems across diverse data modalities, including Magnetic Resonance Imaging (MRI), gait-based pose analysis, gait sensory data, handwriting analysis, speech test data, Electroencephalography (EEG), and multimodal fusion techniques. Based on over 347 articles from leading scientific databases, this review examines key aspects such as data collection methods, settings, feature representations, and system performance, with a focus on recognition accuracy and robustness. This survey aims to serve as a comprehensive resource for researchers, providing actionable guidance for the development of next generation PD recognition systems. By leveraging diverse data modalities and cutting-edge machine learning paradigms, this work contributes to advancing the state of PD diagnostics and improving patient care through innovative, multimodal approaches.

58. 【2505.00462】CORSTITCH - A free, open source software for stitching and georeferencing underwater coral reef videos

链接：https://arxiv.org/abs/2505.00462

作者：Julian Christopher L. Maya,Johnenn R. Manalang,Maricor N. Soriano

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated Rapid Reef, Rapid Reef Assessment, Automated Rapid, Reef Assessment System, video transects obtained

备注：

点击查看摘要

Abstract:CorStitch is an open-source software developed to automate the creation of accurate georeferenced reef mosaics from video transects obtained through Automated Rapid Reef Assessment System surveys. We utilized a Fourier-based image correlation algorithm to stitch sequential video frames, aligning them with synchronized GNSS timestamps. The resulting compressed Keyhole Markup Language files, compatible with geographic information systems such as Google Earth, enable detailed spatial analysis. Validation through comparative analysis of mosaics from two temporally distinct surveys of the same reef demonstrated the software's consistent and reliable performance.

59. 【2505.00374】owards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network

链接：https://arxiv.org/abs/2505.00374

作者：Usman Muhammad,Jorma Laaksonen,Lyudmila Mihaylova

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep neural networks, Deep neural, demonstrated highly competitive, mappings from low-resolution, demonstrated highly

备注：

点击查看摘要

Abstract:Deep neural networks have demonstrated highly competitive performance in super-resolution (SR) for natural images by learning mappings from low-resolution (LR) to high-resolution (HR) images. However, hyperspectral super-resolution remains an ill-posed problem due to the high spectral dimensionality of the data and the scarcity of available training samples. Moreover, existing methods often rely on large models with a high number of parameters or require the fusion with panchromatic or RGB images, both of which are often impractical in real-world scenarios. Inspired by the MobileNet architecture, we introduce a lightweight depthwise separable dilated convolutional network (DSDCN) to address the aforementioned challenges. Specifically, our model leverages multiple depthwise separable convolutions, similar to the MobileNet architecture, and further incorporates a dilated convolution fusion block to make the model more flexible for the extraction of both spatial and spectral features. In addition, we propose a custom loss function that combines mean squared error (MSE), an L2 norm regularization-based constraint, and a spectral angle-based loss, ensuring the preservation of both spectral and spatial details. The proposed model achieves very competitive performance on two publicly available hyperspectral datasets, making it well-suited for hyperspectral image super-resolution tasks. The source codes are publicly available at: \href{this https URL}{this https URL}.

60. 【2505.00133】Efficient and robust 3D blind harmonization for large domain gaps

链接：https://arxiv.org/abs/2505.00133

作者：Hwihun Jeong,Hayeon Lee,Se Young Chun,Jongho Lee

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve scale-invariant representations, target domain data, domain data, source domain data, target domain

备注：

点击查看摘要

Abstract:Blind harmonization has emerged as a promising technique for MR image harmonization to achieve scale-invariant representations, requiring only target domain data (i.e., no source domain data necessary). However, existing methods face limitations such as inter-slice heterogeneity in 3D, moderate image quality, and limited performance for a large domain gap. To address these challenges, we introduce BlindHarmonyDiff, a novel blind 3D harmonization framework that leverages an edge-to-image model tailored specifically to harmonization. Our framework employs a 3D rectified flow trained on target domain images to reconstruct the original image from an edge map, then yielding a harmonized image from the edge of a source domain image. We propose multi-stride patch training for efficient 3D training and a refinement module for robust inference by suppressing hallucination. Extensive experiments demonstrate that BlindHarmonyDiff outperforms prior arts by harmonizing diverse source domain images to the target domain, achieving higher correspondence to the target domain characteristics. Downstream task-based quality assessments such as tissue segmentation and age prediction on diverse MR scanners further confirm the effectiveness of our approach and demonstrate the capability of our robust and generalizable blind harmonization.

61. 【2505.00115】Rootlets-based registration to the spinal cord PAM50 template

链接：https://arxiv.org/abs/2505.00115

作者：Sandrine Bédard,Jan Valošek,Valeria Oliva,Kenneth A. Weber II,Julien Cohen-Adad

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：studies require precise, require precise localization, MRI studies require, Spinal cord, Spinal

备注：

点击查看摘要

Abstract:Spinal cord functional MRI studies require precise localization of spinal levels for reliable voxelwise group analyses. Traditional template-based registration of the spinal cord uses intervertebral discs for alignment. However, substantial anatomical variability across individuals exists between vertebral and spinal levels. This study proposes a novel registration approach that leverages spinal nerve rootlets to improve alignment accuracy and reproducibility across individuals. We developed a registration method leveraging dorsal cervical rootlets segmentation and aligning them non-linearly with the PAM50 spinal cord template. Validation was performed on a multi-subject, multi-site dataset (n=267, 44 sites) and a multi-subject dataset with various neck positions (n=10, 3 sessions). We further validated the method on task-based functional MRI (n=23) to compare group-level activation maps using rootlet-based registration to traditional disc-based methods. Rootlet-based registration showed superior alignment across individuals compared to the traditional disc-based method. Notably, rootlet positions were more stable across neck positions. Group-level analysis of task-based functional MRI using rootlet-based increased Z scores and activation cluster size compared to disc-based registration (number of active voxels from 3292 to 7978). Rootlet-based registration enhances both inter- and intra-subject anatomical alignment and yields better spatial normalization for group-level fMRI analyses. Our findings highlight the potential of rootlet-based registration to improve the precision and reliability of spinal cord neuroimaging group analysis.

62. 【2505.00046】SR-NeRV: Improving Embedding Efficiency of Neural Video Representation via Super-Resolution

链接：https://arxiv.org/abs/2505.00046

作者：Taiga Hayami,Kakeru Koizumi,Hiroshi Watanabe

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：garnered significant attention, Implicit Neural Representations, Implicit Neural, model complex signals, variety of domains

备注：

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have garnered significant attention for their ability to model complex signals across a variety of domains. Recently, INR-based approaches have emerged as promising frameworks for neural video compression. While conventional methods primarily focus on embedding video content into compact neural networks for efficient representation, they often struggle to reconstruct high-frequency details under stringent model size constraints, which are critical in practical compression scenarios. To address this limitation, we propose an INR-based video representation method that integrates a general-purpose super-resolution (SR) network. Motivated by the observation that high-frequency components exhibit low temporal redundancy across frames, our method entrusts the reconstruction of fine details to the SR network. Experimental results demonstrate that the proposed method outperforms conventional INR-based baselines in terms of reconstruction quality, while maintaining comparable model sizes.