本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新517篇论文,其中:

  • 自然语言处理90
  • 信息检索2
  • 计算机视觉126

自然语言处理

1. 【2504.01018】Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

链接https://arxiv.org/abs/2504.01018

作者:Di Wu,Jia-Chen Gu,Kai-Wei Chang,Nanyun Peng

类目:Computation and Language (cs.CL)

关键词:reducing distractions, distractions from low-quality, Selective retrieval, knowledge, RAG

备注: Work in Progress

点击查看摘要

Abstract:Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide between external retrieval and verbalizing its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM on knowledge source selection, knowledge verbalization, and response generation. We further introduce dynamic knowledge source inference via nearest neighbor search to improve the accuracy of knowledge source decision under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces retrievals by 29% while improving the performance by 5.1%.

2. 【2504.01005】When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

链接https://arxiv.org/abs/2504.01005

作者:Nishad Singhi,Hritik Bansal,Arian Hosseini,Aditya Grover,Kai-Wei Chang,Marcus Rohrbach,Anna Rohrbach

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, mathematical problem-solving, key strategy, strategy for enhancing, enhancing the reasoning

备注: 29 pages

点击查看摘要

Abstract:Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at this https URL.

3. 【2504.01002】oken embeddings violate the manifold hypothesis

链接https://arxiv.org/abs/2504.01002

作者:Michael Robinson,Sourya Dey,Tony Chiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fully understand, understand the behavior, input space, requires our understanding, input space differs

备注: 20 pages, 10 figures

点击查看摘要

Abstract:To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions. This model is based on a generalization of a manifold called a fiber bundle, so we denote our hypothesis test as the ``fiber bundle null.'' Failing to reject the null is uninformative, but rejecting it at a specific token indicates that token has a statistically significant local structure, and so is of interest to us. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the token subspace is provably not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, and if one prompt contains a token implicated by our test, that prompt will likely exhibit more output variability proportional to the local signal dimension of the token.

Comments:
20 pages, 10 figures

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

MSC classes:
53Z50, 62H15

Cite as:
arXiv:2504.01002 [cs.CL]

(or
arXiv:2504.01002v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2504.01002

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2504.01001】Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models

链接https://arxiv.org/abs/2504.01001

作者:José Pombal,Nuno M. Guerreiro,Ricardo Rei,André F. T. Martins

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:evaluating them automatically, capable of performing, performing more complex, benchmarks, test data creation

备注

点击查看摘要

Abstract:As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.

5. 【2504.00993】MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs

链接https://arxiv.org/abs/2504.00993

作者:Juncheng Wu,Wenlong Deng,Xingxuan Li,Sheng Liu,Taomian Mi,Yifan Peng,Ziyang Xu,Yi Liu,Hyunjin Cho,Chang-In Choi,Yihan Cao,Hui Ren,Xiang Li,Xiaoxiao Li,Yuyin Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:treatment planning require, planning require precise, Medical, reasoning, life-critical domains

备注

点击查看摘要

Abstract:Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code will be publicly available.

6. 【2504.00977】Chinese Grammatical Error Correction: A Survey

链接https://arxiv.org/abs/2504.00977

作者:Mengyang Qiu,Qingyu Gao,Linxuan Yang,Yang Gu,Tran Minh Nguyen,Zihao Huang,Jungyeul Park

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Grammatical Error Correction, automated writing assistance, task in Natural, Chinese Grammatical Error

备注

点击查看摘要

Abstract:Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

7. 【2504.00970】SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

链接https://arxiv.org/abs/2504.00970

作者:Yuxuan Zhu,Ali Falahati,David H. Yang,Mohammad Mohammadi Amiri

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, face significant computational, language models face, models face significant

备注

点击查看摘要

Abstract:Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive generation, is critical to reducing memory overhead and improving computational efficiency. Traditional token-level efficient KV caching methods overlook semantic information, treating tokens independently without considering their semantic relationships. Meanwhile, existing semantic-preserving KV cache management approaches often suffer from substantial memory usage and high time-to-first-token. To address these limitations, we propose SentenceKV, a novel sentence-level semantic KV caching approach designed to enhance inference efficiency while preserving semantic coherence. During prefilling, SentenceKV groups tokens based on sentence-level semantic similarity, compressing sentence representations into concise semantic vectors stored directly on the GPU, while individual KV pairs are offloaded to CPU. During decoding, SentenceKV generates tokens by selectively retrieving semantically relevant sentence-level KV entries, leveraging the semantic similarity between the prefilling-stage semantic vectors and decoding-stage queries. This ensures efficient and contextually accurate predictions, minimizing the loading of redundant or irrelevant data into GPU memory and significantly reducing memory overhead while maintaining stable inference latency, even for extremely long contexts. Extensive evaluations on benchmarks including PG-19, LongBench, and Needle-In-A-Haystack demonstrate that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.

8. 【2504.00942】Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

链接https://arxiv.org/abs/2504.00942

作者:Anna Bavaresco,Raquel Fernández

类目:Computation and Language (cs.CL)

关键词:text representations learnt, Computational Linguistics, grounded in images, grounded in real-world, images or audio

备注

点击查看摘要

Abstract:A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio -- similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information -- as defined by an existing norm-based 'experiential model' -- and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.

9. 【2504.00939】WikiVideo: Article Generation from Multiple Videos

链接https://arxiv.org/abs/2504.00939

作者:Alexander Martin,Reno Kriz,William Gantt Walden,Kate Sanders,Hannah Recknor,Eugene Yang,Francis Ferraro,Benjamin Van Durme

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:high-level Wikipedia-style article, high-level Wikipedia-style, Wikipedia-style article, political elections, present the challenging

备注: Repo can be found here: [this https URL](https://github.com/alexmartin1722/wikivideo)

点击查看摘要

Abstract:We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

10. 【2504.00934】InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation

链接https://arxiv.org/abs/2504.00934

作者:Zifeng Wang,Junyi Gao,Benjamin Danek,Brandon Theodorou,Ruba Shaik,Shivashankar Thati,Seunghyun Won,Jimeng Sun

类目:Computation and Language (cs.CL)

关键词:Leveraging large language, informed consent forms, significant challenge due, Leveraging large, generate high-stakes documents

备注

点击查看摘要

Abstract:Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.

11. 【2504.00928】axonomizing Representational Harms using Speech Act Theory

链接https://arxiv.org/abs/2504.00928

作者:Emily Corvi,Hannah Washington,Stefanie Reed,Chad Atalla,Alexandra Chouldechova,P. Alex Dow,Jean Garcia-Gathright,Nicholas Pangakis,Emily Sheng,Dan Vann,Matthew Vogel,Hanna Wallach

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:generative language systems, fairness-related harms caused, widely recognized, recognized among fairness-related, caused by generative

备注

点击查看摘要

Abstract:Representational harms are widely recognized among fairness-related harms caused by generative language systems. However, their definitions are commonly under-specified. We present a framework, grounded in speech act theory (Austin, 1962), that conceptualizes representational harms caused by generative language systems as the perlocutionary effects (i.e., real-world impacts) of particular types of illocutionary acts (i.e., system behaviors). Building on this argument and drawing on relevant literature from linguistic anthropology and sociolinguistics, we provide new definitions stereotyping, demeaning, and erasure. We then use our framework to develop a granular taxonomy of illocutionary acts that cause representational harms, going beyond the high-level taxonomies presented in previous work. We also discuss the ways that our framework and taxonomy can support the development of valid measurement instruments. Finally, we demonstrate the utility of our framework and taxonomy via a case study that engages with recent conceptual debates about what constitutes a representational harm and how such harms should be measured.

12. 【2504.00927】Multi-Token Attention

链接https://arxiv.org/abs/2504.00927

作者:Olga Golovneva,Tianlu Wang,Jason Weston,Sainbayar Sukhbaatar

类目:Computation and Language (cs.CL)

关键词:critical mechanism powering, mechanism powering LLMs, Soft attention, attention, critical mechanism

备注

点击查看摘要

Abstract:Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.

13. 【2504.00914】On the Robustness of Agentic Function Calling

链接https://arxiv.org/abs/2504.00914

作者:Ella Rabinovich,Ateret Anaby-Tavor

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, invoke specific tools, capabilities enabling, Language Models

备注: 7 pages, TrustNLP@NAACL25

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.

14. 【2504.00906】Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

链接https://arxiv.org/abs/2504.00906

作者:Saaket Agashe,Kyle Wong,Vincent Tu,Jiachen Yang,Ang Li,Xin Eric Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:graphical user interfaces, enhance human productivity, offering significant potential, automate digital tasks, agents automate digital

备注: 18 pages, 13 figures, 8 tables

点击查看摘要

Abstract:Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at this https URL.

15. 【2504.00891】GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

链接https://arxiv.org/abs/2504.00891

作者:Jian Zhao,Runze Liu,Kaiyan Zhang,Zhimu Zhou,Junqi Gao,Dong Li,Jiafei Lyu,Zhouyi Qian,Biqing Qi,Xiu Li,Bowen Zhou

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Recent advancements, advancements in Large, utilize Process Reward

备注

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in this https URL.

16. 【2504.00882】CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models

链接https://arxiv.org/abs/2504.00882

作者:Wei Zhou,Yuyang Gao,Xuanhe Zhou,Guoliang Li

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:enabling seamless interaction, heterogeneous database systems, Dialect translation plays, plays a key, key role

备注: Extension of our SIGMOD 2025 paper. Please refer to source code available at: [this https URL](https://github.com/weAIDB/CrackSQL)

点击查看摘要

Abstract:Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases

17. 【2504.00869】m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

链接https://arxiv.org/abs/2504.00869

作者:Xiaoke Huang,Juncheng Wu,Hui Liu,Xianfeng Tang,Yuyin Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Test-time scaling, large language models, medical, reasoning, powerful technique

备注: 17 pages; 7 figures; Data, code, and models: [this https URL](https://github.com/UCSC-VLAA/m1)

点击查看摘要

Abstract:Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

18. 【2504.00860】Investigating the Capabilities and Limitations of Machine Learning for Identifying Bias in English Language Data with Information and Heritage Professionals

链接https://arxiv.org/abs/2504.00860

作者:Lucy Havens,Benjamin Bach,Melissa Terras,Beatrice Alex

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:harm already-marginalized people, already-marginalized people, numerous efforts, efforts to mitigate, systems continue

备注: Accepted to the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25)

点击查看摘要

Abstract:Despite numerous efforts to mitigate their biases, ML systems continue to harm already-marginalized people. While predominant ML approaches assume bias can be removed and fair models can be created, we show that these are not always possible, nor desirable, goals. We reframe the problem of ML bias by creating models to identify biased language, drawing attention to a dataset's biases rather than trying to remove them. Then, through a workshop, we evaluated the models for a specific use case: workflows of information and heritage professionals. Our findings demonstrate the limitations of ML for identifying bias due to its contextual nature, the way in which approaches to mitigating it can simultaneously privilege and oppress different communities, and its inevitability. We demonstrate the need to expand ML approaches to bias and fairness, providing a mixed-methods approach to investigating the feasibility of removing bias or achieving fairness in a given ML use case.

19. 【2504.00829】How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

链接https://arxiv.org/abs/2504.00829

作者:Yunjie Ji,Sitong Zhao,Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Yiping Peng,Han Zhao,Xiangang Li

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, artificial intelligence research, intelligence research, Language Models

备注

点击查看摘要

Abstract:Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3\% on the AIME-2024 benchmark, 89.5\% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.

20. 【2504.00824】ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

链接https://arxiv.org/abs/2504.00824

作者:Yubo Wang,Xueguang Ma,Ping Nie,Huaye Zeng,Zhiheng Lyu,Yuxuan Zhang,Benjamin Schneider,Yi Lu,Xiang Yue,Wenhu Chen

类目:Computation and Language (cs.CL)

关键词:coherent text generation, Academic writing requires, requires both coherent, Academic writing, Academic

备注

点击查看摘要

Abstract:Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot's superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.

21. 【2504.00810】Z1: Efficient Test-time Scaling with Code

链接https://arxiv.org/abs/2504.00810

作者:Zhaojian Yu,Yinghao Wu,Yilun Zhao,Arman Cohan,Xiao-Ping Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, entails longer contexts, achieve enhanced complex, enhanced complex problem-solving

备注

点击查看摘要

Abstract:Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., think. . . /think) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.

22. 【2504.00799】Inaccuracy of an E-Dictionary and Its Influence on Chinese Language Users

链接https://arxiv.org/abs/2504.00799

作者:Xi Wang,Fanfei Meng,Shiyang Zhang,Lan Li

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:largely replaced paper, replaced paper dictionaries, learners seeking, expand their vocabulary, largely replaced

备注: 13 pages, presented at ASIALEX 2023 (The 15th International Conference of the Asian Association for Lexicography), Yonsei University, Seoul, Korea

点击查看摘要

Abstract:Electronic dictionaries have largely replaced paper dictionaries and become central tools for L2 learners seeking to expand their vocabulary. Users often assume these resources are reliable and rarely question the validity of the definitions provided. The accuracy of major E-dictionaries is seldom scrutinized, and little attention has been paid to how their corpora are constructed. Research on dictionary use, particularly the limitations of electronic dictionaries, remains scarce. This study adopts a combined method of experimentation, user survey, and dictionary critique to examine Youdao, one of the most widely used E-dictionaries in China. The experiment involved a translation task paired with retrospective reflection. Participants were asked to translate sentences containing words that are insufficiently or inaccurately defined in Youdao. Their consultation behavior was recorded to analyze how faulty definitions influenced comprehension. Results show that incomplete or misleading definitions can cause serious misunderstandings. Additionally, students exhibited problematic consultation habits. The study further explores how such flawed definitions originate, highlighting issues in data processing and the integration of AI and machine learning technologies in dictionary construction. The findings suggest a need for better training in dictionary literacy for users, as well as improvements in the underlying AI models used to build E-dictionaries.

23. 【2504.00780】Digitally Supported Analysis of Spontaneous Speech (DigiSpon): Benchmarking NLP-Supported Language Sample Analysis of Swiss Children's Speech

链接https://arxiv.org/abs/2504.00780

作者:Anja Ryser,Yingqiang Gao,Sarah Ebling

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:complements standardized psychometric, standardized psychometric tests, developmental language disorder, Language sample analysis, sample analysis

备注

点击查看摘要

Abstract:Language sample analysis (LSA) is a process that complements standardized psychometric tests for diagnosing, for example, developmental language disorder (DLD) in children. However, its labor-intensive nature has limited its use in speech-language pathology practice. We introduce an approach that leverages natural language processing (NLP) methods not based on commercial large language models (LLMs) applied to transcribed speech data from 119 children in the German speaking part of Switzerland with typical and atypical language development. The study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently within a human-in-the-loop framework, without relying on potentially unethical implementations that leverage commercial LLMs. Preliminary findings underscore the potential of integrating locally deployed NLP methods into the process of semi-automatic LSA.

24. 【2504.00767】Automated Explanation of Machine Learning Models of Footballing Actions in Words

链接https://arxiv.org/abs/2504.00767

作者:Pegah Rahimian,Jernej Flisar,David Sumpter

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:analysts assess performance, machine learning practice, coaching staff talk, assess performance, analytics has changed

备注

点击查看摘要

Abstract:While football analytics has changed the way teams and analysts assess performance, there remains a communication gap between machine learning practice and how coaching staff talk about football. Coaches and practitioners require actionable insights, which are not always provided by models. To bridge this gap, we show how to build wordalizations (a novel approach that leverages large language models) for shots in football. Specifically, we first build an expected goals model using logistic regression. We then use the co-efficients of this regression model to write sentences describing how factors (such as distance, angle and defensive pressure) contribute to the model's prediction. Finally, we use large language models to give an entertaining description of the shot. We describe our approach in a model card and provide an interactive open-source application describing shots in recent tournaments. We discuss how shot wordalisations might aid communication in coaching and football commentary, and give a further example of how the same approach can be applied to other actions in football.

25. 【2504.00756】RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model

链接https://arxiv.org/abs/2504.00756

作者:Lin Zhang,Zhouhong Gu,Xiaoran Shi,Hongwei Feng,Yanghua Xiao

类目:Computation and Language (cs.CL)

关键词:large language models, efficient knowledge evaluation, large language, Large-scale Reference-based Efficient, verifying their capabilities

备注

点击查看摘要

Abstract:As large language models (LLMs) advance, efficient knowledge evaluation becomes crucial to verifying their capabilities. Traditional methods, relying on benchmarks, face limitations such as high resource costs and information loss. We propose the Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model (RECKON), which directly uses reference data to evaluate models. RECKON organizes unstructured data into manageable units and generates targeted questions for each cluster, improving evaluation accuracy and efficiency. Experimental results show that RECKON reduces resource consumption by 56.5% compared to traditional methods while achieving over 97% accuracy across various domains, including world knowledge, code, legal, and biomedical datasets. Code is available at this https URL

26. 【2504.00752】LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

链接https://arxiv.org/abs/2504.00752

作者:Sameer Sadruddin,Jennifer D'Souza,Eleni Poupaki,Alex Watkins,Hamed Babaei Giglou,Anisa Rula,Bora Karasulu,Sören Auer,Adrie Mackus,Erwin Kessels

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)

关键词:Extracting structured information, Extracting structured, modeling real-world processes, limiting scalability, traditional schema mining

备注: 15 pages, 3 figures, to appear in the Extended Semantic Web Conference (ESWC 2025) proceedings in the Resource track

点击查看摘要

Abstract:Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.

27. 【2504.00748】IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models

链接https://arxiv.org/abs/2504.00748

作者:Yunsoo Kim,Michal W. S. Ong,Daniel W. Rogalsky,Manuel Rodriguez-Justo,Honghan Wu,Adam P. Levine

类目:Computation and Language (cs.CL)

关键词:offering critical insights, IHC-tumour profiles, offering critical, essential in diagnostic, diagnostic pathology

备注: currently under review

点击查看摘要

Abstract:Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC-LLMiner, for extracting IHC-tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC-tumour profile extraction on relevant included abstracts. The best-performing model, "Gemma-2 finetuned", achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4-O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma-2 finetuned model. For IHC-tumour profile extraction, the Gemma-2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC-tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC-tumour profile landscape analysis. The extracted IHC-tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC-tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large-scale IHC-tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer-specific knowledge base development. Models and training datasets are available at this https URL.

28. 【2504.00725】Aplicação de Large Language Models na Análise e Síntese de Documentos Jurídicos: Uma Revisão de Literatura

链接https://arxiv.org/abs/2504.00725

作者:Matheus Belarmino,Rackel Coelho,Roberto Lotudo,Jayr Pereira

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, enabling the automation, optimize the analysis

备注: in Portuguese language

点击查看摘要

Abstract:Large Language Models (LLMs) have been increasingly used to optimize the analysis and synthesis of legal documents, enabling the automation of tasks such as summarization, classification, and retrieval of legal information. This study aims to conduct a systematic literature review to identify the state of the art in prompt engineering applied to LLMs in the legal context. The results indicate that models such as GPT-4, BERT, Llama 2, and Legal-Pegasus are widely employed in the legal field, and techniques such as Few-shot Learning, Zero-shot Learning, and Chain-of-Thought prompting have proven effective in improving the interpretation of legal texts. However, challenges such as biases in models and hallucinations still hinder their large-scale implementation. It is concluded that, despite the great potential of LLMs for the legal field, there is a need to improve prompt engineering strategies to ensure greater accuracy and reliability in the generated results.

29. 【2504.00698】Command A: An Enterprise-Ready Large Language Model

链接https://arxiv.org/abs/2504.00698

作者:Team Cohere,Aakanksha,Arash Ahmadian,Marwan Ahmed,Jay Alammar,Yazeed Alnumay,Sophia Althammer,Arkady Arkhangorodsky,Viraat Aryabumi,Dennis Aumiller,Raphaël Avalos,Zahara Aviv,Sammie Bae,Saurabh Baji,Alexandre Barbet,Max Bartolo,Björn Bebensee,Neeral Beladia,Walter Beller-Morales,Alexandre Bérard,Andrew Berneshawi,Anna Bialas,Phil Blunsom,Matt Bobkin,Adi Bongale,Sam Braun,Maxime Brunet,Samuel Cahyawijaya,David Cairuz,Jon Ander Campos,Cassie Cao,Kris Cao,Roman Castagné,Julián Cendrero,Leila Chan Currie,Yash Chandak,Diane Chang,Giannis Chatziveroglou,Hongyu Chen,Claire Cheng,Alexis Chevalier,Justin T. Chiu,Eugene Cho,Eugene Choi,Eujeong Choi,Tim Chung,Volkan Cirik,Ana Cismaru,Pierre Clavier,Henry Conklin,Lucas Crawhall-Stein,Devon Crouse,Andres Felipe Cruz-Salinas,Ben Cyrus,Daniel D'souza,Hugo Dalla-Torre,John Dang,William Darling,Omar Darwiche Domingues,Saurabh Dash,Antoine Debugne,Théo Dehaze,Shaan Desai,Joan Devassy,Rishit Dholakia,Kyle Duffy,Ali Edalati,Ace Eldeib,Abdullah Elkady,Sarah Elsharkawy,Irem Ergün,Beyza Ermis,Marzieh Fadaee,Boyu Fan,Lucas Fayoux,Yannis Flet-Berliac,Nick Frosst,Matthias Gallé,Wojciech Galuba,Utsav Garg,Matthieu Geist,Mohammad Gheshlaghi Azar,Seraphina Goldfarb-Tarrant,Tomas Goldsack,Aidan Gomez,Victor Machado Gonzaga,Nithya Govindarajan,Manoj Govindassamy,Nathan Grinsztajn,Nikolas Gritsch,Patrick Gu,Shangmin Guo,Kilian Haefeli,Rod Hajjar,Tim Hawes,Jingyi He,Sebastian Hofstätter,Sungjin Hong,Sara Hooker,Tom Hosking

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:powerful large language, Retrieval Augmented Generation, enterprise use cases, large language model, language model purpose-built

备注: 55 pages

点击查看摘要

Abstract:In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.

30. 【2504.00695】oReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection

链接https://arxiv.org/abs/2504.00695

作者:Xiaoxuan Zhu,Zhouhong Gu,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao

类目:Computation and Language (cs.CL)

关键词:necessitates enormous diverse, diverse textual corpora, enormous diverse textual, balancing computational resources, making effective data

备注

点击查看摘要

Abstract:Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at this https URL.

31. 【2504.00676】GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition

链接https://arxiv.org/abs/2504.00676

作者:Anthony Yazdani,Ihor Stepanov,Douglas Teodoro

类目:Computation and Language (cs.CL)

关键词:presents unique challenges, unique challenges due, presents unique, specialized vocabularies, Biomedical named entity

备注

点击查看摘要

Abstract:Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at this https URL.

32. 【2504.00664】Do LLMs Surpass Encoders for Biomedical NER?

链接https://arxiv.org/abs/2504.00664

作者:Motasem S Obeidat,Md Sultan Al Nahian,Ramakanth Kavuluru

类目:Computation and Language (cs.CL)

关键词:Recognizing spans, named entity recognition, drug or gene, free text, NER

备注: Accepted to appear in IEEE ICHI 2025

点击查看摘要

Abstract:Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length = 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.

33. 【2504.00661】DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

链接https://arxiv.org/abs/2504.00661

作者:Dengchun Li,Naizheng Wang,Zihao Zhang,Haoyang Yin,Lei Duan,Meng Xiao,Mingjie Tang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language processing, achieved remarkable success, large language models, Instruction-based fine-tuning, language processing

备注: 22 pages, 7 figures

点击查看摘要

Abstract:Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.

34. 【2504.00657】News is More than a Collection of Facts: Moral Frame Preserving News Summarization

链接https://arxiv.org/abs/2504.00657

作者:Enrico Liscio,Michela Lorandi,Pradeep K. Murukannaiah

类目:Computation and Language (cs.CL)

关键词:reflect journalists' framing, collections of facts, shaping how events, reflect journalists', events are presented

备注

点击查看摘要

Abstract:News articles are more than collections of facts; they reflect journalists' framing, shaping how events are presented to the audience. One key aspect of framing is the choice to write in (or quote verbatim) morally charged language as opposed to using neutral terms. This moral framing carries implicit judgments that automated news summarizers should recognize and preserve to maintain the original intent of the writer. In this work, we perform the first study on the preservation of moral framing in AI-generated news summaries. We propose an approach that leverages the intuition that journalists intentionally use or report specific moral-laden words, which should be retained in summaries. Through automated, crowd-sourced, and expert evaluations, we demonstrate that our approach enhances the preservation of moral framing while maintaining overall summary quality.

35. 【2504.00623】Efficient Construction of Model Family through Progressive Training Using Model Expansion

链接https://arxiv.org/abs/2504.00623

作者:Kazuki Yano,Sho Takase,Sosuke Kobayashi,Shun Kiyono,Jun Suzuki

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, widespread practical application, diverse computational requirements, address diverse computational

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.

36. 【2504.00597】On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2504.00597

作者:Jirui Qi,Raquel Fernández,Arianna Bisazza

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval-augmented generation, demonstrated strong performance, large language models, tasks by leveraging, demonstrated strong

备注: Under review at COLM2025. All codes and data are released at [this https URL](https://anonymous.4open.science/r/RAG-Consistency/)

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, *independently from retrieval quality*, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.

37. 【2504.00595】Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

链接https://arxiv.org/abs/2504.00595

作者:Weizhi Wang,Yu Tian,Linjie Yang,Heng Wang,Xifeng Yan

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, data mixture strategies, Large Language Model, pre-training faces barriers, mixture strategies

备注

点击查看摘要

Abstract:The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

38. 【2504.00589】Efficient Annotator Reliablity Assessment with EffiARA

链接https://arxiv.org/abs/2504.00589

作者:Owen Cook,Jake Vasilakes,Ian Roberts,Xingyi Song

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:machine learning pipeline, Data annotation, time-consuming process, EffiARA Python package, essential component

备注

点击查看摘要

Abstract:Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework's efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at this https URL and the webtool is publicly accessible at this https URL.

39. 【2504.00587】AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

链接https://arxiv.org/abs/2504.00587

作者:Yingxuan Yang,Huacan Chai,Shuai Shao,Yuanyi Song,Siyuan Qi,Renting Rui,Weinan Zhang

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, advancement of Large, solve complex tasks

备注

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has catalyzed the development of multi-agent systems, where multiple LLM-based agents collaborate to solve complex tasks. However, existing systems predominantly rely on centralized coordination, which introduces scalability bottlenecks, limits adaptability, and creates single points of failure. Additionally, concerns over privacy and proprietary knowledge sharing hinder cross-organizational collaboration, leading to siloed expertise. To address these challenges, we propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to autonomously evolve their capabilities and collaborate efficiently in a Directed Acyclic Graph (DAG)-structured network. Unlike traditional multi-agent systems that depend on static role assignments or centralized control, AgentNet allows agents to specialize dynamically, adjust their connectivity, and route tasks without relying on predefined workflows. AgentNet's core design is built upon several key innovations: (1) Fully Decentralized Paradigm: Removing the central orchestrator, allowing agents to coordinate and specialize autonomously, fostering fault tolerance and emergent collective intelligence. (2) Dynamically Evolving Graph Topology: Real-time adaptation of agent connections based on task demands, ensuring scalability and resilience.(3) Adaptive Learning for Expertise Refinement: A retrieval-based memory system that enables agents to continuously update and refine their specialized skills. By eliminating centralized control, AgentNet enhances fault tolerance, promotes scalable specialization, and enables privacy-preserving collaboration across organizations. Through decentralized coordination and minimal data exchange, agents can leverage diverse knowledge sources while safeguarding sensitive information.

40. 【2504.00584】Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach

链接https://arxiv.org/abs/2504.00584

作者:Hongliu Cao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Inference, Inference and Sentiment, natural language processing, natural language, Sentiment Analysis tasks

备注

点击查看摘要

Abstract:Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERTa or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over contextual text embeddings in various tasks. However, due to the bias in popular evaluation benchmarks, the negation awareness capacity of these models remains unclear. To bridge the gap in existing literature, an in-depth analysis is initiated in this work to study the negation awareness of cutting-edge universal text embedding models. Our findings reveal a significant lack of negation awareness in these models, often interpreting negated text pairs as semantically similar. To efficiently deal with the conflict that different tasks need different trade-offs between topic and negation information among other semantic information, a data-efficient and computational-efficient embedding re-weighting method is proposed without modifying the parameters of text embedding models. The proposed solution is able to improve text embedding models' negation awareness significantly on both simple negation understanding task and complex negation understanding task. Furthermore, the proposed solution can also significantly improve the negation awareness of Large Language Model based task-specific high dimensional universal text embeddings.

41. 【2504.00573】raining a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models

链接https://arxiv.org/abs/2504.00573

作者:Yilong Xu,Jinhua Gao,Xiaoming Yu,Yuanhai Xue,Baolong Bi,Huawei Shen,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:Retrieval-Augmented Language Models, Language Models boost, Language Models, Models boost task, Retrieval-Augmented Language

备注: 20 pages, 9 figures. Code will be released after review

点击查看摘要

Abstract:Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provides valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility for better task generalization. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.

42. 【2504.00532】SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking

链接https://arxiv.org/abs/2504.00532

作者:Hongru Ma,Yanjie Liang,Jiasheng Si,Weiyu Zhang,Hongjiao Guan,Chaoqun Zheng,Bing Xu,Wenpeng Lu

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Large language models, significantly enhancing developer, enhancing developer productivity, Large language, language models

备注: 23 pages

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized code generation, significantly enhancing developer productivity. However, for a vast number of users with minimal coding knowledge, LLMs provide little support, as they primarily generate isolated code snippets rather than complete, large-scale project code. Without coding expertise, these users struggle to interpret, modify, and iteratively refine the outputs of LLMs, making it impossible to assemble a complete project. To address this issue, we propose Self-Rectified Large-Scale Code Generator (SRLCG), a framework that generates complete multi-file project code from a single prompt. SRLCG employs a novel multidimensional chain-of-thought (CoT) and self-rectification to guide LLMs in generating correct and robust code files, then integrates them into a complete and coherent project using our proposed dynamic backtracking algorithm. Experimental results show that SRLCG generates code 15x longer than DeepSeek-V3, 16x longer than GPT-4, and at least 10x longer than other leading CoT-based baselines. Furthermore, they confirm its improved correctness, robustness, and performance compared to baselines in large-scale code generation.

43. 【2504.00509】Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

链接https://arxiv.org/abs/2504.00509

作者:Kai Yan,Yufei Xu,Zhengyin Du,Xuesong Yao,Zheyu Wang,Xiaowen Guo,Jiecao Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:surpassing human intelligence, rapid escalation, recent years, years have weaved, weaved a miracle

备注: 21 pages, 3 figures, 10 tables

点击查看摘要

Abstract:The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer $60\%$ performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

44. 【2504.00502】ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

链接https://arxiv.org/abs/2504.00502

作者:Qianhao Yuan,Qingyu Zhang,Yanjiang Liu,Jiawei Chen,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注: Project page: [this https URL](https://github.com/icip-cas/ShortV)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at this https URL

45. 【2504.00487】FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

链接https://arxiv.org/abs/2504.00487

作者:Jie Ma,Zhitao Gao,Qi Chai,Jun Liu,Pinghui Wang,Jing Tao,Zhou Su

类目:Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:audio-video inputs accurately, reasoning task requiring, task requiring intelligent, requiring intelligent systems, answer natural language

备注: Under Review

点击查看摘要

Abstract:Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at this https URL.

46. 【2504.00473】Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences

链接https://arxiv.org/abs/2504.00473

作者:Xiangyang Liu,Junliang He,Xipeng Qiu

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, generating intermediate thoughts, perform complex reasoning, Orchestrated Streaming Experiences

备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Large language models (LLMs) can perform complex reasoning by generating intermediate thoughts under zero-shot or few-shot settings. However, zero-shot prompting always encounters low performance, and the superior performance of few-shot prompting hinges on the manual-crafted demonstrations. In this paper, we present RoSE (Reasoning with Orchestrated Streaming Experiences), a general framework for solving reasoning tasks that can self-improve without complex external efforts. To enable RoSE, we describe an architecture that extends an LLM to store all answered questions and their thoughts in a streaming experience pool then orchestrates helpful questions from the pool to assist in answering new questions. To set up a question-aware orchestration mechanism, RoSE first calculates the similarity of each question in the pool with a new test question. Since the solution to each answered question is not always correct, RoSE will sort the questions according to their similarity with the new question, and then uniformly divide them into multiple buckets. It finally extracts one question from each bucket to make these extracted questions more diverse. To make these extracted questions help RoSE answer new questions as much as possible, we introduce two other attributes of uncertainty and complexity for each question. RoSE will preferentially select the questions with low uncertainty and high complexity from each bucket. We evaluate the versatility of RoSE in various reasoning tasks, LLMs, and CoT methods.

47. 【2504.00472】Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning

链接https://arxiv.org/abs/2504.00472

作者:Ruoxi Xu,Yunjie Ji,Boxi Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Ben He,Yingfei Sun,Xiangang Li,Le Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:knowledge injection, large language models, static nature leads, real world evolves, knowledge

备注

点击查看摘要

Abstract:Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels.

48. 【2504.00414】Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents

链接https://arxiv.org/abs/2504.00414

作者:Gavin Greif,Niclas Griesshaber,Robin Greif

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)

关键词:Large Language Models, multimodal Large Language, Large Language, Optical Character Recognition, Named Entity Recognition

备注

点击查看摘要

Abstract:We explore how multimodal Large Language Models (mLLMs) can help researchers transcribe historical documents, extract relevant historical information, and construct datasets from historical sources. Specifically, we investigate the capabilities of mLLMs in performing (1) Optical Character Recognition (OCR), (2) OCR Post-Correction, and (3) Named Entity Recognition (NER) tasks on a set of city directories published in German between 1754 and 1870. First, we benchmark the off-the-shelf transcription accuracy of both mLLMs and conventional OCR models. We find that the best-performing mLLM model significantly outperforms conventional state-of-the-art OCR models and other frontier mLLMs. Second, we are the first to introduce multimodal post-correction of OCR output using mLLMs. We find that this novel approach leads to a drastic improvement in transcription accuracy and consistently produces highly accurate transcriptions (1% CER), without any image pre-processing or model fine-tuning. Third, we demonstrate that mLLMs can efficiently recognize entities in transcriptions of historical documents and parse them into structured dataset formats. Our findings provide early evidence for the long-term potential of mLLMs to introduce a paradigm shift in the approaches to historical data collection and document transcription.

49. 【2504.00409】Semantic Mastery: Enhancing LLMs with Advanced Natural Language Understanding

链接https://arxiv.org/abs/2504.00409

作者:Mohanakrishnan Hariharan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, greatly improved, improved their capability, performing NLP tasks

备注

点击查看摘要

Abstract:Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learning. We analyze the use of structured knowledge graphs, retrieval-augmented generation (RAG), and fine-tuning strategies that match models with human-level understanding. Furthermore, we address the incorporation of transformer-based architectures, contrastive learning, and hybrid symbolic-neural methods that address problems like hallucinations, ambiguity, and inconsistency in the factual perspectives involved in performing complex NLP tasks, such as question-answering text summarization and dialogue generation. Our findings show the importance of semantic precision for enhancing AI-driven language systems and suggest future research directions to bridge the gap between statistical language models and true natural language understanding.

50. 【2504.00406】VerifiAgent: a Unified Verification Agent in Language Model Reasoning

链接https://arxiv.org/abs/2504.00406

作者:Jiuzhou Han,Wray Buntine,Ehsan Shareghi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, language models demonstrate, models demonstrate remarkable, demonstrate remarkable reasoning

备注

点击查看摘要

Abstract:Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at this https URL

51. 【2504.00374】When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

链接https://arxiv.org/abs/2504.00374

作者:Mahak Agarwal,Divyam Khanna

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:single Large Language, Large Language Model, contradictory claims-some accurate, Large Language, encounter contradictory claims-some

备注: 10 pages, 6 figures

点击查看摘要

Abstract:In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

52. 【2504.00343】Leveraging Large Language Models for Automated Definition Extraction with TaxoMatic A Case Study on Media Bias

链接https://arxiv.org/abs/2504.00343

作者:Timo Spinde,Luyang Lin,Smi Hinterreiter,Isao Echizen

类目:Computation and Language (cs.CL)

关键词:leverages large language, large language models, paper introduces TaxoMatic, automate definition extraction, academic literature

备注

点击查看摘要

Abstract:This paper introduces TaxoMatic, a framework that leverages large language models to automate definition extraction from academic literature. Focusing on the media bias domain, the framework encompasses data collection, LLM-based relevance classification, and extraction of conceptual definitions. Evaluated on a dataset of 2,398 manually rated articles, the study demonstrates the frameworks effectiveness, with Claude-3-sonnet achieving the best results in both relevance classification and definition extraction. Future directions include expanding datasets and applying TaxoMatic to additional domains.

53. 【2504.00339】VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

链接https://arxiv.org/abs/2504.00339

作者:Hoang Hai Phan,Nguyen Duc Minh Vu,Nam Dang Phuong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Neural Machine Translation, Neural Machine, driven by Transformer, low-resource language pairs, pairs like Vietnamese-Japanese

备注

点击查看摘要

Abstract:Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.

54. 【2504.00316】Effect-driven interpretation: Functors for natural language composition

链接https://arxiv.org/abs/2504.00316

作者:Dylan Bumford,Simon Charlow

类目:Computation and Language (cs.CL)

关键词:parallel threads, total functions, inputs to outputs, side effects, current loop

备注

点击查看摘要

Abstract:Computer programs are often factored into pure components -- simple, total functions from inputs to outputs -- and components that may have side effects -- errors, changes to memory, parallel threads, abortion of the current loop, etc. We make the case that human languages are similarly organized around the give and pull of pure values and impure processes, and we'll aim to show how denotational techniques from computer science can be leveraged to support elegant and illuminating analyses of natural language composition.

55. 【2504.00310】Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training

链接https://arxiv.org/abs/2504.00310

作者:Rajeev Kumar,Harishankar Kumar,Kumari Shalini

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, revolutionized natural language, natural language processing, generate human-like text, Large language

备注

点击查看摘要

Abstract:Large language models have revolutionized natural language processing with their surprising capability to understand and generate human-like text. However, many of these models inherit and further amplify the biases present in their training data, raising ethical and fairness concerns. The detection and mitigation of such biases are vital to ensuring that LLMs act responsibly and equitably across diverse domains. This work investigates Knowledge Graph-Augmented Training (KGAT) as a novel method to mitigate bias in LLM. Using structured domain-specific knowledge from real-world knowledge graphs, we improve the understanding of the model and reduce biased output. Public datasets for bias assessment include Gender Shades, Bias in Bios, and FairFace, while metrics such as demographic parity and equal opportunity facilitate rigorous detection. We also performed targeted mitigation strategies to correct biased associations, leading to a significant drop in biased output and improved bias metrics. Equipped with real-world datasets and knowledge graphs, our framework is both scalable and effective, paving the way toward responsible deployment in sensitive and high-stakes applications.

56. 【2504.00294】Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

链接https://arxiv.org/abs/2504.00294

作者:Vidhisha Balachandran,Jingya Chen,Lingjiao Chen,Shivam Garg,Neel Joshi,Yash Lara,John Langford,Besmira Nushi,Vibhav Vineet,Yue Wu,Safoora Yousefi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, capabilities of large, large language, Inference-time scaling, models

备注

点击查看摘要

Abstract:Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.

57. 【2504.00289】Do Chinese models speak Chinese languages?

链接https://arxiv.org/abs/2504.00289

作者:Andrea W Wen-Yi,Unso Eun Seo Jo,David Mimno

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:cemented China role, top-performing open-weight LLMs, release of top-performing, top-performing open-weight, leading force

备注: First and Second author contribute equally

点击查看摘要

Abstract:The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.

58. 【2504.00285】Do Large Language Models Exhibit Spontaneous Rational Deception?

链接https://arxiv.org/abs/2504.00285

作者:Samuel M. Taylor,Benjamin K. Bergen

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, effective at deceiving, Large, LLMs

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.

59. 【2504.00274】xt Chunking for Document Classification for Urban System Management using Large Language Models

链接https://arxiv.org/abs/2504.00274

作者:Joshua Rodriguez(1),Om Sanan(2),Guillermo Vizarreta-Luna(1),Steven A. Conrad(1) ((1) Department of Systems Engineering, Colorado State University, Fort Collins, CO, USA, (2) Scarsdale High School, Scardsale, NY, USA)

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:evaluate built environment, complex textual documentation, built environment performance, managed using complex, evaluate built

备注: 16 pages, 6 figures, 4 tables, 2 algorithms; Replication data and code can be found [this https URL](https://github.com/josh-rodriguez-csu/ChunkingforLLMs)

点击查看摘要

Abstract:Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI's GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.

60. 【2504.00265】Multilingual Sentiment Analysis of Summarized Texts: A Cross-Language Study of Text Shortening Effects

链接https://arxiv.org/abs/2504.00265

作者:Mikhail Krasitskii,Grigori Sidorov,Olga Kolesnikova,Liliana Chanona Hernandez,Alexander Gelbukh

类目:Computation and Language (cs.CL)

关键词:Summarization significantly impacts, significantly impacts sentiment, diverse morphologies, significantly impacts, sentiment

备注

点击查看摘要

Abstract:Summarization significantly impacts sentiment analysis across languages with diverse morphologies. This study examines extractive and abstractive summarization effects on sentiment classification in English, German, French, Spanish, Italian, Finnish, Hungarian, and Arabic. We assess sentiment shifts post-summarization using multilingual transformers (mBERT, XLM-RoBERTa, T5, and BART) and language-specific models (FinBERT, AraBERT). Results show extractive summarization better preserves sentiment, especially in morphologically complex languages, while abstractive summarization improves readability but introduces sentiment distortion, affecting sentiment accuracy. Languages with rich inflectional morphology, such as Finnish, Hungarian, and Arabic, experience greater accuracy drops than English or German. Findings emphasize the need for language-specific adaptations in sentiment analysis and propose a hybrid summarization approach balancing readability and sentiment preservation. These insights benefit multilingual sentiment applications, including social media monitoring, market analysis, and cross-lingual opinion mining.

61. 【2504.00255】SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

链接https://arxiv.org/abs/2504.00255

作者:Yanzheng Xiang,Hanqi Yan,Shuyin Ouyang,Lin Gui,Yulan He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)

关键词:recent NLP papers, large language models, study evaluates large, evaluates large language, NLP papers

备注

点击查看摘要

Abstract:This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful Non-Reasoning LLMs and Reasoning LLMs as foundational models. The best-performing LLM using Sci-Reproducer achieves only 39% execution accuracy, highlighting the benchmark's this http URL analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We will open-source our benchmark, and code at this https URL.

62. 【2504.00254】ElaLoRA: Elastic Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

链接https://arxiv.org/abs/2504.00254

作者:Huandong Chang,Zicheng Ma,Mingyuan Ma,Zhenting Qi,Andrew Sabot,Hong Jiang,H. T. Kung

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:widely adopted technique, minimal parameter updates, large-scale pre-trained models, fine-tuning large-scale pre-trained, widely adopted

备注

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.

63. 【2504.00241】Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy

链接https://arxiv.org/abs/2504.00241

作者:Rabimba Karanjai,Boris Shor,Amanda Austin,Ryan Kennedy,Yang Lu,Lei Xu,Weidong Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, public opinion data, synthesize public opinion, declining response rates

备注

点击查看摘要

Abstract:This paper investigates the use of Large Language Models (LLMs) to synthesize public opinion data, addressing challenges in traditional survey methods like declining response rates and non-response bias. We introduce a novel technique: role creation based on knowledge injection, a form of in-context learning that leverages RAG and specified personality profiles from the HEXACO model and demographic information, and uses that for dynamically generated prompts. This method allows LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. We compare our results with pre-trained models with standard few-shot prompts. Experiments using questions from the Cooperative Election Study (CES) demonstrate that our role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. In addition, we discuss challenges, limitations and future research directions.

64. 【2504.00218】$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

链接https://arxiv.org/abs/2504.00218

作者:Rana Muhammad Shahroz Khan,Zhen Tan,Sukwon Yun,Charles Flemming,Tianlong Chen

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Model, Large Language, discussions about Large, texttt, multi-agent LLM systems

备注

点击查看摘要

Abstract:Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a $\textit{permutation-invariant adversarial attack}$ that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including $\texttt{Llama}$, $\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on various datasets like $\texttt{JailBreakBench}$ and $\texttt{AdversarialBench}$, our method outperforms conventional attacks by up to $7\times$, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of $\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.

65. 【2504.00187】Insight-RAG: Enhancing LLMs with Insight-Driven Augmentation

链接https://arxiv.org/abs/2504.00187

作者:Pouya Pezeshkpour,Estevam Hruschka

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Retrieval Augmented Generation, Augmented Generation, large language models, leveraging external knowledge, shown significant promise

备注

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) frameworks have shown significant promise in leveraging external knowledge to enhance the performance of large language models (LLMs). However, conventional RAG methods often retrieve documents based solely on surface-level relevance, leading to many issues: they may overlook deeply buried information within individual documents, miss relevant insights spanning multiple sources, and are not well-suited for tasks beyond traditional question answering. In this paper, we propose Insight-RAG, a novel framework designed to address these issues. In the initial stage of Insight-RAG, instead of using traditional retrieval methods, we employ an LLM to analyze the input query and task, extracting the underlying informational requirements. In the subsequent stage, a specialized LLM -- trained on the document database -- is queried to mine content that directly addresses these identified insights. Finally, by integrating the original query with the retrieved insights, similar to conventional RAG approaches, we employ a final LLM to generate a contextually enriched and accurate response. Using two scientific paper datasets, we created evaluation benchmarks targeting each of the mentioned issues and assessed Insight-RAG against traditional RAG pipeline. Our results demonstrate that the Insight-RAG pipeline successfully addresses these challenges, outperforming existing methods by a significant margin in most cases. These findings suggest that integrating insight-driven retrieval within the RAG framework not only enhances performance but also broadens the applicability of RAG to tasks beyond conventional question answering.

66. 【2504.00180】Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency

链接https://arxiv.org/abs/2504.00180

作者:Vignesh Gokul,Srikanth Tenneti,Alwarappan Nakkiran

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval Augmented Generation, enhancing large language, Retrieval Augmented, Augmented Generation, large language models

备注

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.

67. 【2504.00178】Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

链接https://arxiv.org/abs/2504.00178

作者:Craig W. Schmidt,Varshini Reddy,Chris Tanner,Yuval Pinter

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:modern tokenization pipelines, Byte Pair Encoding, smaller units called, typically splitting, whitespace and punctuation

备注

点击查看摘要

Abstract:Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distribution of tokens in a corpus to heavily skew towards common, full-length words. This skewed distribution limits the benefits of expanding to larger vocabularies, since the additional tokens appear with progressively lower counts. To overcome this barrier, we propose BoundlessBPE, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword. Superwords are not necessarily semantically cohesive. For example, the pretokens " of" and " the" might be combined to form the superword " of the". This merging strategy results in a substantially more uniform distribution of tokens across a corpus than standard BPE, and compresses text more effectively, with an approximate 20% increase in bytes per token.

68. 【2504.00163】Does "Reasoning" with Large Language Models Improve Recognizing, Generating, and Reframing Unhelpful Thoughts?

链接https://arxiv.org/abs/2504.00163

作者:Yilin Qi,Dong Won Lee,Cynthia Breazeal,Hae Won Park

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Cognitive Behavioral Therapy, Behavioral Therapy, finding positive meaning, individuals reinterpret negative, reinterpret negative experiences

备注: 8 pages, 3 figures (including appendix)

点击查看摘要

Abstract:Cognitive Reframing, a core element of Cognitive Behavioral Therapy (CBT), helps individuals reinterpret negative experiences by finding positive meaning. Recent advances in Large Language Models (LLMs) have demonstrated improved performance through reasoning-based strategies. This inspires a promising direction of leveraging the reasoning capabilities of LLMs to improve CBT and mental reframing by simulating the process of critical thinking, potentially enabling more effective recognition, generation, and reframing of cognitive distortions. In this work, we investigate the role of various reasoning methods, including pre-trained reasoning LLMs and augmented reasoning strategies such as CoT and self-consistency in enhancing LLMs' ability to perform cognitive reframing tasks. We find that augmented reasoning methods, even when applied to "outdated" LLMs like GPT-3.5, consistently outperform state-of-the-art pretrained reasoning models on recognizing, generating and reframing unhelpful thoughts.

69. 【2504.00147】Universal Zero-shot Embedding Inversion

链接https://arxiv.org/abs/2504.00147

作者:Collin Zhang,John X. Morris,Vitaly Shmatikov

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:black-box access, fundamental problem, NLP perspective, Embedding, NLP

备注

点击查看摘要

Abstract:Embedding inversion, i.e., reconstructing text given its embedding and black-box access to the embedding encoder, is a fundamental problem in both NLP and security. From the NLP perspective, it helps determine how much semantic information about the input is retained in the embedding. From the security perspective, it measures how much information is leaked by vector databases and embedding-based retrieval systems. State-of-the-art methods for embedding inversion, such as vec2text, have high accuracy but require (a) training a separate model for each embedding, and (b) a large number of queries to the corresponding encoder. We design, implement, and evaluate ZSInvert, a zero-shot inversion method based on the recently proposed adversarial decoding technique. ZSInvert is fast, query-efficient, and can be used for any text embedding without training an embedding-specific inversion model. We measure the effectiveness of ZSInvert on several embeddings and demonstrate that it recovers key semantic information about the corresponding texts.

Subjects:

Computation and Language (cs.CL); Cryptography and Security (cs.CR)

Cite as:
arXiv:2504.00147 [cs.CL]

(or
arXiv:2504.00147v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2504.00147

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
70. 【2504.00132】Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B

链接https://arxiv.org/abs/2504.00132

作者:Aleksandra Bakalova,Yana Veitsman,Xinting Huang,Michael Hahn

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:In-Context Learning, intriguing ability, ability of large, large language models, naturalistic ICL tasks

备注

点击查看摘要

Abstract:In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.

71. 【2504.00125】LLMs for Explainable AI: A Comprehensive Survey

链接https://arxiv.org/abs/2504.00125

作者:Ahsan Bilal,David Ebert,Beiyu Lin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:transforming complex machine, Large Language Models, complex machine learning, making model predictions, sophisticated model behavior

备注: This manuscript is intended for submission to ACM Transactions on Intelligent Systems and Technology

点击查看摘要

Abstract:Large Language Models (LLMs) offer a promising approach to enhancing Explainable AI (XAI) by transforming complex machine learning outputs into easy-to-understand narratives, making model predictions more accessible to users, and helping bridge the gap between sophisticated model behavior and human interpretability. AI models, such as state-of-the-art neural networks and deep learning models, are often seen as "black boxes" due to a lack of transparency. As users cannot fully understand how the models reach conclusions, users have difficulty trusting decisions from AI models, which leads to less effective decision-making processes, reduced accountabilities, and unclear potential biases. A challenge arises in developing explainable AI (XAI) models to gain users' trust and provide insights into how models generate their outputs. With the development of Large Language Models, we want to explore the possibilities of using human language-based models, LLMs, for model explainabilities. This survey provides a comprehensive overview of existing approaches regarding LLMs for XAI, and evaluation techniques for LLM-generated explanation, discusses the corresponding challenges and limitations, and examines real-world applications. Finally, we discuss future directions by emphasizing the need for more interpretable, automated, user-centric, and multidisciplinary approaches for XAI via LLMs.

72. 【2504.00061】Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

链接https://arxiv.org/abs/2504.00061

作者:Dou Liu,Ying Long,Sophia Zuoqiu,Tian Tang,Rong Yin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:sensitive medical areas, Large Language Models, Effective physician-patient communications, pre-diagnostic environments, communications in pre-diagnostic

备注: Accepted by IISE 2025 annual conference

点击查看摘要

Abstract:Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $\alpha$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.

73. 【2504.00053】Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records

链接https://arxiv.org/abs/2504.00053

作者:Jie Pan,Seungwon Lee,Cheligeer Cheligeer,Elliot A. Martin,Kiarash Riazi,Hude Quan,Na Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Electronic health records, complement administrative data-based, administrative data-based disease, data-based disease surveillance, Electronic health

备注

点击查看摘要

Abstract:Objective: Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR clinical notes. Methods: We linked a cardiac registry cohort in 2015 with an EHR system in Alberta, Canada. We developed a pipeline that leveraged a generative large language model (LLM) to analyze, understand, and interpret EHR notes by prompts based on specific diagnosis, treatment management, and clinical guidelines. The pipeline was applied to detect acute myocardial infarction (AMI), diabetes, and hypertension. The performance was compared against clinician-validated diagnoses as the reference standard and widely adopted International Classification of Diseases (ICD) codes-based methods. Results: The study cohort accounted for 3,088 patients and 551,095 clinical notes. The prevalence was 55.4%, 27.7%, 65.9% and for AMI, diabetes, and hypertension, respectively. The performance of the LLM-based pipeline for detecting conditions varied: AMI had 88% sensitivity, 63% specificity, and 77% positive predictive value (PPV); diabetes had 91% sensitivity, 86% specificity, and 71% PPV; and hypertension had 94% sensitivity, 32% specificity, and 72% PPV. Compared with ICD codes, the LLM-based method demonstrated improved sensitivity and negative predictive value across all conditions. The monthly percentage trends from the detected cases by LLM and reference standard showed consistent patterns.

74. 【2504.00051】he Cursive Transformer

链接https://arxiv.org/abs/2504.00051

作者:Sam Greydanus,Zachary Wimpee

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:high-quality autoregressive samples, Transformers trained, generate high-quality autoregressive, tokenized text, autoregressive samples

备注: 11 pages, 8 figures

点击查看摘要

Abstract:Transformers trained on tokenized text, audio, and images can generate high-quality autoregressive samples. But handwriting data, represented as sequences of pen coordinates, remains underexplored. We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens with which to train a standard GPT model. This allows us to capture complex stroke distributions without using any specialized architectures (eg. the mixture density network or the self-advancing ASCII attention head from Graves 2014). With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting. Our approach is simpler and more performant than previous RNN-based methods.

75. 【2504.00050】JudgeLRM: Large Reasoning Models as a Judge

链接https://arxiv.org/abs/2504.00050

作者:Nuo Chen,Zhiyuan Hu,Qingyun Zou,Jiaying Wu,Qian Wang,Bryan Hooi,Bingsheng He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:existing Supervised Fine-Tuning, Large Language Models, Large Language, Supervised Fine-Tuning, rise of Large

备注: preprint

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

76. 【2504.00048】Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs

链接https://arxiv.org/abs/2504.00048

作者:Cong Duy Vu Hoang,Gioacchino Tangari,Clemence Lanfranchi,Dalu Guo,Paul Cayet,Steve Siu,Don Dharmasiri,Yuan-Fang Li,Long Duong,Damien Hilloulin,Rhicheek Patra,Sungpack Hong,Hassan Chafi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language, interest in Natural, Language to SQL, growing adoption, business applications

备注: Preprint, accepted at NAACL 2025 (Industry Track)

点击查看摘要

Abstract:The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.

77. 【2504.00046】Multi-Stakeholder Disaster Insights from Social Media Using Large Language Models

链接https://arxiv.org/abs/2504.00046

作者:Loris Belcastro,Cristian Cosentino,Fabrizio Marozzo,Merve Gündüz-Cüre,Şule Öztürk-Birim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Social and Information Networks (cs.SI)

关键词:promptly share feedback, social media, recent years, playing a key, primary channel

备注

点击查看摘要

Abstract:In recent years, social media has emerged as a primary channel for users to promptly share feedback and issues during disasters and emergencies, playing a key role in crisis management. While significant progress has been made in collecting and analyzing social media content, there remains a pressing need to enhance the automation, aggregation, and customization of this data to deliver actionable insights tailored to diverse stakeholders, including the press, police, EMS, and firefighters. This effort is essential for improving the coordination of activities such as relief efforts, resource distribution, and media communication. This paper presents a methodology that leverages the capabilities of LLMs to enhance disaster response and management. Our approach combines classification techniques with generative AI to bridge the gap between raw user feedback and stakeholder-specific reports. Social media posts shared during catastrophic events are analyzed with a focus on user-reported issues, service interruptions, and encountered challenges. We employ full-spectrum LLMs, using analytical models like BERT for precise, multi-dimensional classification of content type, sentiment, emotion, geolocation, and topic. Generative models such as ChatGPT are then used to produce human-readable, informative reports tailored to distinct audiences, synthesizing insights derived from detailed classifications. We compare standard approaches, which analyze posts directly using prompts in ChatGPT, to our advanced method, which incorporates multi-dimensional classification, sub-event selection, and tailored report generation. Our methodology demonstrates superior performance in both quantitative metrics, such as text coherence scores and latent representations, and qualitative assessments by automated tools and field experts, delivering precise insights for diverse disaster response stakeholders.

78. 【2504.00045】Measuring Online Hate on 4chan using Pre-trained Deep Learning Models

链接https://arxiv.org/abs/2504.00045

作者:Adrian Bermudez-Villalva,Maryam Mehrnezhad,Ehsan Toreini

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:harmfully impact individuals, Natural Language Processing, post anonymous content, Online hate, individuals and groups

备注: IEEE Transactions on Technology and Society, 11 pages

点击查看摘要

Abstract:Online hate speech can harmfully impact individuals and groups, specifically on non-moderated platforms such as 4chan where users can post anonymous content. This work focuses on analysing and measuring the prevalence of online hate on 4chan's politically incorrect board (/pol/) using state-of-the-art Natural Language Processing (NLP) models, specifically transformer-based models such as RoBERTa and Detoxify. By leveraging these advanced models, we provide an in-depth analysis of hate speech dynamics and quantify the extent of online hate non-moderated platforms. The study advances understanding through multi-class classification of hate speech (racism, sexism, religion, etc.), while also incorporating the classification of toxic content (e.g., identity attacks and threats) and a further topic modelling analysis. The results show that 11.20% of this dataset is identified as containing hate in different categories. These evaluations show that online hate is manifested in various forms, confirming the complicated and volatile nature of detection in the wild.

79. 【2504.00044】Dynamic hashtag recommendation in social media with trend shift detection and adaptation

链接https://arxiv.org/abs/2504.00044

作者:Riccardo Cantini,Fabrizio Marozzo,Alessio Orsino,Domenico Talia,Paolo Trunfio

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)

关键词:requires efficient methods, social media platforms, categorization and search, generation of vast, vast amounts

备注

点击查看摘要

Abstract:The widespread use of social media platforms results in the generation of vast amounts of user-generated content, which requires efficient methods for categorization and search. Hashtag recommendation systems have emerged as a crucial tool for automatically suggesting relevant hashtags and improving content discoverability. However, existing static models struggle to adapt to the highly dynamic and real-time nature of social media conversations, where new hashtags emerge and existing ones undergo semantic shifts. To address these challenges, this paper presents H-ADAPTS (Hashtag recommendAtion by Detecting and adAPting to Trend Shifts), a BERT-based hashtag recommendation methodology that can detect and adapt to shifts in the main trends and topics underlying social media conversation. Our approach introduces a trend-aware detection mechanism to identify changes in hashtag usage, triggering efficient model adaptation on a (small) set of recent posts. The framework leverages Apache Storm for real-time stream processing, enabling scalable and fault-tolerant analysis of high-velocity social data. Experimental results on two real-world case studies, including the COVID-19 pandemic and the 2020 US presidential election, demonstrate the ability to maintain high recommendation accuracy by adapting to emerging trends. Our methodology significantly outperforms existing solutions, ensuring timely and relevant hashtag recommendations in dynamic environments.

80. 【2504.00043】CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

链接https://arxiv.org/abs/2504.00043

作者:Jixuan Leng,Chengsong Huang,Langlin Huang,Bill Yuchen Lin,William W. Cohen,Haohan Wang,Jiaxin Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Vision-Language Models, Large Language, limited dynamic interplay, vision-language understanding capabilities

备注

点击查看摘要

Abstract:Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

81. 【2504.00042】Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge

链接https://arxiv.org/abs/2504.00042

作者:Agam Shah,Liqin Ye,Sebastian Jaskowski,Wei Xu,Sudheer Chava

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, frequently utilized, utilized as sources

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model's cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs' knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. We will make the code, prompts, and model outputs public upon the publication of the work.

82. 【2504.00040】Quantum Methods for Managing Ambiguity in Natural Language Processing

链接https://arxiv.org/abs/2504.00040

作者:Jurek Eisinger,Ward Gauderis,Lin de Huybrecht,Geraint A. Wiggins

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)

关键词:Categorical Compositional Distributional, Compositional Distributional, Categorical Compositional, Natural Language Processing, framework models meaning

备注

点击查看摘要

Abstract:The Categorical Compositional Distributional (DisCoCat) framework models meaning in natural language using the mathematical framework of quantum theory, expressed as formal diagrams. DisCoCat diagrams can be associated with tensor networks and quantum circuits. DisCoCat diagrams have been connected to density matrices in various contexts in Quantum Natural Language Processing (QNLP). Previous use of density matrices in QNLP entails modelling ambiguous words as probability distributions over more basic words (the word \texttt{queen}, e.g., might mean the reigning queen or the chess piece). In this article, we investigate using probability distributions over processes to account for syntactic ambiguity in sentences. The meanings of these sentences are represented by density matrices. We show how to create probability distributions on quantum circuits that represent the meanings of sentences and explain how this approach generalises tasks from the literature. We conduct an experiment to validate the proposed theory.

83. 【2504.00031】Leaking LoRa: An Evaluation of Password Leaks and Knowledge Storage in Large Language Models

链接https://arxiv.org/abs/2504.00031

作者:Ryan Marinelli,Magnus Eckhoff

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:deploy Large Language, effectively deploy Large, Large Language Models, Large Language, application-specific settings

备注

点击查看摘要

Abstract:To effectively deploy Large Language Models (LLMs) in application-specific settings, fine-tuning techniques are applied to enhance performance on specialized tasks. This process often involves fine-tuning on user data data, which may contain sensitive information. Although not recommended, it is not uncommon for users to send passwords in messages, and fine-tuning models on this could result in passwords being leaked. In this study, a Large Language Model is fine-tuned with customer support data and passwords from the RockYou password wordlist using Low-Rank Adaptation (LoRA). Out of the first 200 passwords from the list, 37 were successfully recovered. Further, causal tracing is used to identify that password information is largely located in a few layers. Lastly, Rank One Model Editing (ROME) is used to remove the password information from the model, resulting in the number of passwords recovered going from 37 to 0.

84. 【2504.00030】oken-Driven GammaTune: Adaptive Calibration for Enchanced Speculative Decoding

链接https://arxiv.org/abs/2504.00030

作者:Aayush Gautam,Susav Shrestha,Narasimha Annapareddy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:accelerates large language, decoding accelerates large, textit, GammaTune, large language model

备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.

85. 【2504.00027】Opioid Named Entity Recognition (ONER-2025) from Reddit

链接https://arxiv.org/abs/2504.00027

作者:Muhammad Ahmad,Humaira Farid,Iqra Ameer,Muhammad Muzamil,Ameer Hamza Muhammad Jalal,Ildar Batyrshin,Grigori Sidorov

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:United States, public health crisis, critical public health, overdose epidemic remains, health crisis

备注

点击查看摘要

Abstract:The opioid overdose epidemic remains a critical public health crisis, particularly in the United States, leading to significant mortality and societal costs. Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. Our research makes four key contributions. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. This dataset contains 331,285 tokens and includes eight major opioid entity categories. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented sentences, and emotionally charged language, in opioid discussions. Fourth, we propose a real-time monitoring system to process streaming data from social media, healthcare records, and emergency services to identify overdose events. Using 5-fold cross-validation in 11 experiments, our system integrates machine learning, deep learning, and transformer-based language models with advanced contextual embeddings to enhance understanding. Our transformer-based models (bert-base-NER and roberta-base) achieved 97% accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).

86. 【2504.00025】Generalization Bias in Large Language Model Summarization of Scientific Research

链接https://arxiv.org/abs/2504.00025

作者:Uwe Peters,Benjamin Chin-Yee

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Artificial intelligence chatbots, intelligence chatbots driven, quickly summarize complex, Artificial intelligence, summarize complex scientific

备注

点击查看摘要

Abstract:Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26 to 73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (OR = 4.85, 95% CI [3.06, 7.70]). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.

87. 【2504.00021】FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages

链接https://arxiv.org/abs/2504.00021

作者:Rahul Raja,Arpita Vats

类目:Computation and Language (cs.CL)

关键词:Shared Task, Machine Translation, FUSE integrates Ridge, paper presents, presents the winning

备注: NACCL 2025

点击查看摘要

Abstract:This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.

88. 【2504.00019】ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding

链接https://arxiv.org/abs/2504.00019

作者:Indraneil Paul,Haoyi Yang,Goran Glavaš,Kristian Kersting,Iryna Gurevych

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词:code-writing toolbox, pre-training, natural language LMs, code, language LMs

备注

点击查看摘要

Abstract:Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.

89. 【2504.00016】Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1

链接https://arxiv.org/abs/2504.00016

作者:Birger Moell,Fredrik Sand Aronsson,Sanian Akbar

类目:Computation and Language (cs.CL)

关键词:Integrating large language, healthcare requires rigorous, requires rigorous evaluation, Integrating large, large language models

备注

点击查看摘要

Abstract:Integrating large language models (LLMs) like DeepSeek R1 into healthcare requires rigorous evaluation of their reasoning alignment with clinical expertise. This study assesses DeepSeek R1's medical reasoning against expert patterns using 100 MedQA clinical cases. The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors. However, error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care. Crucially, reasoning length correlated with accuracy - shorter responses (5,000 characters) were more reliable, suggesting extended explanations may signal uncertainty or rationalization of errors. While DeepSeek R1 exhibits foundational clinical reasoning capabilities, recurring flaws highlight critical areas for refinement, including bias mitigation, knowledge updates, and structured reasoning frameworks. These findings underscore LLMs' potential to augment medical decision-making through artificial reasoning but emphasize the need for domain-specific validation, interpretability safeguards, and confidence metrics (e.g., response length thresholds) to ensure reliability in real-world applications.

90. 【2503.24388】RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

链接https://arxiv.org/abs/2503.24388

作者:Zhonghan Zhao,Wenwei Zhang,Haian Huang,Kuikun Liu,Jianfei Gao,Gaoang Wang,Kai Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:embodied agents operating, complex open-world environments, essential for embodied, operating in complex, complex open-world

备注

点击查看摘要

Abstract:Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

信息检索

1. 【2504.00882】CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models

链接https://arxiv.org/abs/2504.00882

作者:Wei Zhou,Yuyang Gao,Xuanhe Zhou,Guoliang Li

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:enabling seamless interaction, heterogeneous database systems, Dialect translation plays, plays a key, key role

备注: Extension of our SIGMOD 2025 paper. Please refer to source code available at: [this https URL](https://github.com/weAIDB/CrackSQL)

点击查看摘要

Abstract:Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases

2. 【2504.00828】Linked Array Tree: A Constant-Time Search Structure for Big Data

链接https://arxiv.org/abs/2504.00828

作者:Songpeng Liu

类目:Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

关键词:face increasing challenges, Linked Array Tree, data volumes continue, traditional search algorithms, red-black tree

备注

点击查看摘要

Abstract:As data volumes continue to grow rapidly, traditional search algorithms, like the red-black tree and B+ Tree, face increasing challenges in performance, especially in big data scenarios with intensive storage access. This paper presents the Linked Array Tree (LAT), a novel data structure designed to achieve constant-time complexity for search, insertion, and deletion operations. LAT leverages a sparse, non-moving hierarchical layout that enables direct access paths without requiring rebalancing or data movement. Its low memory overhead and avoidance of pointer-heavy structures make it well-suited for large-scale and intensive workloads. While not specifically tested under parallel or concurrent conditions, the structure's static layout and non-interfering operations suggest potential advantages in such environments. This paper first introduces the structure and algorithms of LAT, followed by a detailed analysis of its time complexity in search, insertion, and deletion operations. Finally, it presents experimental results across both data-intensive and sparse usage scenarios to evaluate LAT's practical performance.

Subjects:

Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

Cite as:
arXiv:2504.00828 [cs.DB]

(or
arXiv:2504.00828v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2504.00828

Focus to learn more

              arXiv-issued DOI via DataCite</p>

计算机视觉

1. 【2504.01020】Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

链接https://arxiv.org/abs/2504.01020

作者:Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Eshika Khandelwal,Gül Varol,Weidi Xie,Andrew Zisserman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Audio Descriptions, edited video material, video material, edited video, Descriptions

备注: Project Page: [this https URL](https://www.robots.ox.ac.uk/vgg/research/shot-by-shot/)

点击查看摘要

Abstract:Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary Visual-Language Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure -- an action score -- specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.

2. 【2504.01019】MixerMDM: Learnable Composition of Human Motion Diffusion Models

链接https://arxiv.org/abs/2504.01019

作者:Pablo Ruiz-Ponce,German Barquero,Cristina Palmero,Sergio Escalera,José García-Rodríguez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating human motion, motion diffusion models, Generating human, human motion guided, challenging due

备注: CVPR 2025 Accepted - Project Page: [this https URL](https://pabloruizponce.com/papers/MixerMDM)

点击查看摘要

Abstract:Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.

3. 【2504.01017】Scaling Language-Free Visual Representation Learning

链接https://arxiv.org/abs/2504.01017

作者:David Fan,Shengbang Tong,Jiachen Zhu,Koustuv Sinha,Zhuang Liu,Xinlei Chen,Michael Rabbat,Nicolas Ballas,Yann LeCun,Amir Bar,Saining Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:underperforms Contrastive Language-Image, Contrastive Language-Image Pretraining, Visual Question Answering, visual SSL, underperforms Contrastive

备注: Project page at [this https URL](https://davidfan.io/webssl/)

点击查看摘要

Abstract:Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

4. 【2504.01016】GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

链接https://arxiv.org/abs/2504.01016

作者:Tian-Xing Xu,Xiangjun Gao,Wenbo Hu,Xiaoyu Li,Song-Hai Zhang,Ying Shan

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:existing methods exhibit, grounded downstream tasks, methods exhibit inherent, exhibit inherent limitations, achieving geometric fidelity

备注: Project webpage: [this https URL](https://geometrycrafter.github.io/)

点击查看摘要

Abstract:Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

5. 【2504.01014】AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

链接https://arxiv.org/abs/2504.01014

作者:Junhao Cheng,Yuying Ge,Yixiao Ge,Jing Liao,Ying Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, large language models, synthesis have opened, promise in generative, Recent

备注: Project released at: [this https URL](https://howe125.github.io/AnimeGamer.github.io/)

点击查看摘要

Abstract:Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at this https URL.

6. 【2504.01010】A YOLO-Based Semi-Automated Labeling Approach to Improve Fault Detection Efficiency in Railroad Videos

链接https://arxiv.org/abs/2504.01010

作者:Dylan Lester,James Gao,Samuel Sutphin,Pingping Zhu,Husnu Narman,Ammar Alzarrad

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Manual labeling, railroad videos, posing a significant, fault detection, significant barrier

备注: Published on American Society of Engineering Education (ASEE) North Central Section Conference, 2025

点击查看摘要

Abstract:Manual labeling for large-scale image and video datasets is often time-intensive, error-prone, and costly, posing a significant barrier to efficient machine learning workflows in fault detection from railroad videos. This study introduces a semi-automated labeling method that utilizes a pre-trained You Only Look Once (YOLO) model to streamline the labeling process and enhance fault detection accuracy in railroad videos. By initiating the process with a small set of manually labeled data, our approach iteratively trains the YOLO model, using each cycle's output to improve model accuracy and progressively reduce the need for human intervention. To facilitate easy correction of model predictions, we developed a system to export YOLO's detection data as an editable text file, enabling rapid adjustments when detections require refinement. This approach decreases labeling time from an average of 2 to 4 minutes per image to 30 seconds to 2 minutes, effectively minimizing labor costs and labeling errors. Unlike costly AI based labeling solutions on paid platforms, our method provides a cost-effective alternative for researchers and practitioners handling large datasets in fault detection and other detection based machine learning applications.

Comments:
Published on American Society of Engineering Education (ASEE) North Central Section Conference, 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Cite as:
arXiv:2504.01010 [cs.CV]

(or
arXiv:2504.01010v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2504.01010

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2504.01009】GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

链接https://arxiv.org/abs/2504.01009

作者:Saarthak Kapse,Pushpak Pati,Srikar Yellapragada,Srijan Das,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras,Prateek Prasanna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Instance Learning, Instance Learning, Slide Image, Multiple Instance, aggregator enables

备注

点击查看摘要

Abstract:Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at this https URL

8. 【2504.01008】IntrinsiX: High-Quality PBR Generation using Image Priors

链接https://arxiv.org/abs/2504.01008

作者:Peter Kocsis(1),Lukas Höllein(1),Matthias Nießner(1) ((1) Technical University of Munich)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generates high-quality intrinsic, introduce IntrinsiX, text description, generates high-quality, PBR

备注: Project page: [this https URL](https://peter-kocsis.github.io/IntrinsiX/) Video: [this https URL](https://youtu.be/b0wVA44R93Y)

点击查看摘要

Abstract:We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.

9. 【2504.01004】Enhancing 3T BOLD fMRI SNR using Unpaired 7T Data with Schrödinger Bridge Diffusion

链接https://arxiv.org/abs/2504.01004

作者:Yujian Xiong,Xuanzhao Dong,Sebastian Waz,Wenhui Zhu,Negar Mallak,Zhong-lin Lu,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:processes visual stimuli, brain processes visual, High spatial, Tesla fMRI, MRI systems

备注

点击查看摘要

Abstract:High spatial and temporal resolution, coupled with a strong signal-to-noise ratio (SNR), has made BOLD 7 Tesla fMRI an invaluable tool for understanding how the brain processes visual stimuli. However, the limited availability of 7T MRI systems means that most research relies on 3T MRI systems, which offer lower spatial and temporal resolution and SNR. This naturally raises the question: Can we enhance the spatiotemporal resolution and SNR of 3T BOLD fMRI data to approximate 7T quality? In this study, we propose a novel framework that aligns 7T and 3T fMRI data from different subjects and datasets in a shared parametric domain. We then apply an unpaired Brain Disk Schrödinger Bridge diffusion model to enhance the spatiotemporal resolution and SNR of the 3T data. Our approach addresses the challenge of limited 7T data by improving the 3T scan quality. We demonstrate its effectiveness by testing it on two distinct fMRI retinotopy datasets (one 7T and one 3T), as well as synthetic data. The results show that our method significantly improves the SNR and goodness-of-fit of the population receptive field (pRF) model in the enhanced 3T data, making it comparable to 7T quality. The codes will be available at Github.

10. 【2504.00999】MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

链接https://arxiv.org/abs/2504.00999

作者:Siyuan Li,Luyuan Zhang,Zedong Wang,Juanxi Tian,Cheng Tan,Zicheng Liu,Chang Yu,Qingsong Xie,Haonan Lu,Haoqian Wang,Zhen Lei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Masked Image Modeling, achieved great success, Vector Quantization, Masked Image, Image Modeling

备注: CVPR2025 (in process for more analysis and extension)

点击查看摘要

Abstract:Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at this https URL.

11. 【2504.00996】urboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

链接https://arxiv.org/abs/2504.00996

作者:Liangbin Xie,Daniil Pakhomov,Zhonghao Wang,Zongze Wu,Ziyan Chen,Yuqian Zhou,Haitian Zheng,Zhifei Zhang,Zhe Lin,Jiantao Zhou,Chao Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fast image inpainting, paper introduces TurboFill, paper introduces, fast image, image inpainting model

备注: Project webpage available at [this https URL](https://liangbinxie.github.io/projects/TurboFill/)

点击查看摘要

Abstract:This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: this https URL

12. 【2504.00992】SuperDec: 3D Scene Decomposition with Superquadric Primitives

链接https://arxiv.org/abs/2504.00992

作者:Elisabetta Fedele,Boyang Sun,Leonidas Guibas,Marc Pollefeys,Francis Engelmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present SuperDec, approach for creating, scene representations, creating compact, leverage geometric primitives

备注

点击查看摘要

Abstract:We present SuperDec, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.

13. 【2504.00983】WorldScore: A Unified Evaluation Benchmark for World Generation

链接https://arxiv.org/abs/2504.00983

作者:Haoyi Duan,Hong-Xing Yu,Sirui Chen,Li Fei-Fei,Jiajun Wu

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:world generation, generation, WorldScore benchmark, decompose world generation, introduce the WorldScore

备注: Project website: [this https URL](https://haoyi-duan.github.io/WorldScore/) The first two authors contributed equally

点击查看摘要

Abstract:We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at this https URL

14. 【2504.00979】Artificial Intelligence-Assisted Prostate Cancer Diagnosis for Reduced Use of Immunohistochemistry

链接https://arxiv.org/abs/2504.00979

作者:Anders Blilie(1 and 2),Nita Mulliqi(3),Xiaoyi Ji(3),Kelvin Szolnoky(3),Sol Erika Boman(3 and 4),Matteo Titus(3),Geraldine Martinez Gonzalez(3),José Asenjo(5),Marcello Gambacorta(6),Paolo Libretti(6),Einar Gudlaugsson(1),Svein R. Kjosavik(7 and 8),Lars Egevad(9),Emiel A.M. Janssen(1 and 10 and 11),Martin Eklund(3),Kimmo Kartasalo(12) ((1) Department of Pathology, Stavanger University Hospital, Stavanger, Norway, (2) Faculty of Health Sciences, University of Stavanger, Stavanger, Norway, (3) Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, (4) Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden, (5) Department of Pathology, Synlab, Madrid, Spain, (6) Department of Pathology, Synlab, Brescia, Italy, (7) The General Practice and Care Coordination Research Group, Stavanger University Hospital, Stavanger, Norway (8) Department of Global Public Health and Primary Care, Faculty of Medicine, University of Bergen, Bergen, Norway, (9) Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden, (10) Faculty of Science and Technology, University of Stavanger, Stavanger, Norway, (11) Institute for Biomedicine and Glycomics, Griffith University, Queensland, Australia, (12) Department of Medical Epidemiology and Biostatistics, SciLifeLab, Karolinska Institutet, Stockholm, Sweden)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnosis heavily relies, histopathological evaluation, subject to variability, heavily relies, relies on histopathological

备注: 29 pages, 5 figures and 3 tables

点击查看摘要

Abstract:Prostate cancer diagnosis heavily relies on histopathological evaluation, which is subject to variability. While immunohistochemical staining (IHC) assists in distinguishing benign from malignant tissue, it involves increased work, higher costs, and diagnostic delays. Artificial intelligence (AI) presents a promising solution to reduce reliance on IHC by accurately classifying atypical glands and borderline morphologies in hematoxylin eosin (HE) stained tissue sections. In this study, we evaluated an AI model's ability to minimize IHC use without compromising diagnostic accuracy by retrospectively analyzing prostate core needle biopsies from routine diagnostics at three different pathology sites. These cohorts were composed exclusively of difficult cases where the diagnosing pathologists required IHC to finalize the diagnosis. The AI model demonstrated area under the curve values of 0.951-0.993 for detecting cancer in routine HE-stained slides. Applying sensitivity-prioritized diagnostic thresholds reduced the need for IHC staining by 44.4%, 42.0%, and 20.7% in the three cohorts investigated, without a single false negative prediction. This AI model shows potential for optimizing IHC use, streamlining decision-making in prostate pathology, and alleviating resource burdens.

15. 【2504.00954】IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

链接https://arxiv.org/abs/2504.00954

作者:Bangwei Liu,Yicheng Bao,Shaohui Lin,Xuhong Wang,Xin Tan,Yingchun Wang,Yuan Xie,Chaochao Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:digital content industries, AI-driven digital content, Multimodal retrieval systems, cutting-edge AI technologies, content industries

备注

点击查看摘要

Abstract:Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at this https URL.

16. 【2504.00952】Personalized Federated Training of Diffusion Models with Privacy Guarantees

链接https://arxiv.org/abs/2504.00952

作者:Kumar Kshitij Patel,Weitong Zhang,Lingxiao Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:ethically sourced data, sourced data presents, scarcity of accessible, artificial intelligence, fields like healthcare

备注: 18 pages, 4 figures

点击查看摘要

Abstract:The scarcity of accessible, compliant, and ethically sourced data presents a considerable challenge to the adoption of artificial intelligence (AI) in sensitive fields like healthcare, finance, and biomedical research. Furthermore, access to unrestricted public datasets is increasingly constrained due to rising concerns over privacy, copyright, and competition. Synthetic data has emerged as a promising alternative, and diffusion models -- a cutting-edge generative AI technology -- provide an effective solution for generating high-quality and diverse synthetic data. In this paper, we introduce a novel federated learning framework for training diffusion models on decentralized private datasets. Our framework leverages personalization and the inherent noise in the forward diffusion process to produce high-quality samples while ensuring robust differential privacy guarantees. Our experiments show that our framework outperforms non-collaborative training methods, particularly in settings with high data heterogeneity, and effectively reduces biases and imbalances in synthetic data, resulting in fairer downstream models.

17. 【2504.00950】Neural Pruning for 3D Scene Reconstruction: Efficient NeRF Acceleration

链接https://arxiv.org/abs/2504.00950

作者:Tianqi Ding,Dawei Xiang,Pablo Rivas,Liang Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Radiance Fields, Neural Radiance, reconstruction approach, recent years

备注: 12 pages, 4 figures, accepted by International Conference on the AI Revolution: Research, Ethics, and Society (AIR-RES 2025)

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have become a popular 3D reconstruction approach in recent years. While they produce high-quality results, they also demand lengthy training times, often spanning days. This paper studies neural pruning as a strategy to address these concerns. We compare pruning approaches, including uniform sampling, importance-based methods, and coreset-based techniques, to reduce the model size and speed up training. Our findings show that coreset-driven pruning can achieve a 50% reduction in model size and a 35% speedup in training, with only a slight decrease in accuracy. These results suggest that pruning can be an effective method for improving the efficiency of NeRF models in resource-limited settings.

18. 【2504.00946】GKAN: Explainable Diagnosis of Alzheimer's Disease Using Graph Neural Network with Kolmogorov-Arnold Networks

链接https://arxiv.org/abs/2504.00946

作者:Tianqi Ding,Dawei Xiang,Keith E Schubert,Liang Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:progressive neurodegenerative disorder, poses significant diagnostic, significant diagnostic challenges, diagnostic challenges due, Graph Convolutional Networks

备注: 12 pages, 4 figures, under review of The Southwest Data Science Conference (SDSC 2025)

点击查看摘要

Abstract:Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that poses significant diagnostic challenges due to its complex etiology. Graph Convolutional Networks (GCNs) have shown promise in modeling brain connectivity for AD diagnosis, yet their reliance on linear transformations limits their ability to capture intricate nonlinear patterns in neuroimaging data. To address this, we propose GCN-KAN, a novel single-modal framework that integrates Kolmogorov-Arnold Networks (KAN) into GCNs to enhance both diagnostic accuracy and interpretability. Leveraging structural MRI data, our model employs learnable spline-based transformations to better represent brain region interactions. Evaluated on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, GCN-KAN outperforms traditional GCNs by 4-8% in classification accuracy while providing interpretable insights into key brain regions associated with AD. This approach offers a robust and explainable tool for early AD diagnosis.

19. 【2504.00943】Graph Classification and Radiomics Signature for Identification of Tuberculous Meningitis

链接https://arxiv.org/abs/2504.00943

作者:Snigdha Agarwal,Ganaraja V H,Neelam Sinha,Abhilasha Indoria,Netravathi M,Jitender Saini

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Tuberculous meningitis, Mycobacterium tuberculosis, caused by Mycobacterium, brain infection caused, Magnetic Resonance Imaging

备注: 19 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Introduction: Tuberculous meningitis (TBM) is a serious brain infection caused by Mycobacterium tuberculosis, characterized by inflammation of the meninges covering the brain and spinal cord. Diagnosis often requires invasive lumbar puncture (LP) and cerebrospinal fluid (CSF) analysis. Objectives: This study aims to classify TBM patients using T1-weighted (T1w) non-contrast Magnetic Resonance Imaging (MRI) scans. We hypothesize that specific brain regions, such as the interpeduncular cisterns, bone, and corpus callosum, contain visual markers that can non-invasively distinguish TBM patients from healthy controls. We propose a novel Pixel-array Graphs Classifier (PAG-Classifier) that leverages spatial relationships between neighbouring 3D pixels in a graph-based framework to extract significant features through eigen decomposition. These features are then used to train machine learning classifiers for effective patient classification. We validate our approach using a radiomics-based methodology, classifying TBM patients based on relevant radiomics features. Results: We utilized an internal dataset consisting of 52 scans, 32 from confirmed TBM patients based on mycobacteria detection in CSF, and 20 from healthy individuals. We achieved a 5-fold cross-validated average F1 score of 85.71% for cistern regions with our PAG-Classifier and 92.85% with the radiomics features classifier, surpassing current state-of-the-art benchmarks by 15% and 22%, respectively. However, bone and corpus callosum regions showed poor classification effectiveness, with average F1 scores below 50%. Conclusion: Our study suggests that algorithms like the PAG-Classifier serve as effective tools for non-invasive TBM analysis, particularly by targeting the interpeduncular cistern. Findings indicate that the bone and corpus callosum regions lack distinctive patterns for differentiation.

20. 【2504.00939】WikiVideo: Article Generation from Multiple Videos

链接https://arxiv.org/abs/2504.00939

作者:Alexander Martin,Reno Kriz,William Gantt Walden,Kate Sanders,Hannah Recknor,Eugene Yang,Francis Ferraro,Benjamin Van Durme

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:high-level Wikipedia-style article, high-level Wikipedia-style, Wikipedia-style article, political elections, present the challenging

备注: Repo can be found here: [this https URL](https://github.com/alexmartin1722/wikivideo)

点击查看摘要

Abstract:We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

21. 【2504.00908】DBF-UNet: A Two-Stage Framework for Carotid Artery Segmentation with Pseudo-Label Generation

链接https://arxiv.org/abs/2504.00908

作者:Haoxuan Li,Wei Song,Aofan Liu,Peiwu Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical image analysis, image analysis faces, analysis faces significant, exhibit spatially discontinuous, Medical image

备注

点击查看摘要

Abstract:Medical image analysis faces significant challenges due to limited annotation data, particularly in three-dimensional carotid artery segmentation tasks, where existing datasets exhibit spatially discontinuous slice annotations with only a small portion of expert-labeled slices in complete 3D volumetric data. To address this challenge, we propose a two-stage segmentation framework. First, we construct continuous vessel centerlines by interpolating between annotated slice centroids and propagate labels along these centerlines to generate interpolated annotations for unlabeled slices. The slices with expert annotations are used for fine-tuning SAM-Med2D, while the interpolated labels on unlabeled slices serve as prompts to guide segmentation during inference. In the second stage, we propose a novel Dense Bidirectional Feature Fusion UNet (DBF-UNet). This lightweight architecture achieves precise segmentation of complete 3D vascular structures. The network incorporates bidirectional feature fusion in the encoder and integrates multi-scale feature aggregation with dense connectivity for effective feature reuse. Experimental validation on public datasets demonstrates that our proposed method effectively addresses the sparse annotation challenge in carotid artery segmentation while achieving superior performance compared to existing approaches. The source code is available at this https URL.

22. 【2504.00906】Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

链接https://arxiv.org/abs/2504.00906

作者:Saaket Agashe,Kyle Wong,Vincent Tu,Jiachen Yang,Ang Li,Xin Eric Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:graphical user interfaces, enhance human productivity, offering significant potential, automate digital tasks, agents automate digital

备注: 18 pages, 13 figures, 8 tables

点击查看摘要

Abstract:Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at this https URL.

23. 【2504.00901】A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities

链接https://arxiv.org/abs/2504.00901

作者:Enzhe Sun,Yongchuan Cui,Peng Liu,Jining Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:launch costs make, costs make direct, make direct acquisition, Hardware limitations, satellite launch costs

备注

点击查看摘要

Abstract:Hardware limitations and satellite launch costs make direct acquisition of high temporal-spatial resolution remote sensing imagery challenging. Remote sensing spatiotemporal fusion (STF) technology addresses this problem by merging high temporal but low spatial resolution imagery with high spatial but low temporal resolution imagery to efficiently generate high spatiotemporal resolution satellite images. STF provides unprecedented observational capabilities for land surface change monitoring, agricultural management, and environmental research. Deep learning (DL) methods have revolutionized the remote sensing spatiotemporal fusion field over the past decade through powerful automatic feature extraction and nonlinear modeling capabilities, significantly outperforming traditional methods in handling complex spatiotemporal data. Despite the rapid development of DL-based remote sensing STF, the community lacks a systematic review of this quickly evolving field. This paper comprehensively reviews DL developments in remote sensing STF over the last decade, analyzing key research trends, method classifications, commonly used datasets, and evaluation metrics. It discusses major challenges in existing research and identifies promising future research directions as references for researchers in this field to inspire new ideas. The specific models, datasets, and other information mentioned in this article have been collected in: this https URL.

24. 【2504.00883】Improved Visual-Spatial Reasoning via R1-Zero-Like Training

链接https://arxiv.org/abs/2504.00883

作者:Zhenyi Liao,Qingsong Xie,Yanhao Zhang,Zijian Kong,Haonan Lu,Zhenyu Yang,Zhijie Deng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multi-modal large language, Increasing attention, large language models, visual-spatial reasoning, multi-modal large

备注

点击查看摘要

Abstract:Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.

25. 【2504.00879】WISE-TTT:Worldwide Information Segmentation Enhancement

链接https://arxiv.org/abs/2504.00879

作者:Fenglei Hao,Yuliang Yang,Ruiyuan Su,Zhengran Zhao,Yukun Qiao,Mengyu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-target segmentation remains, global temporal dependencies, remains a major, major challenge, inherent limitations

备注

点击查看摘要

Abstract:Video multi-target segmentation remains a major challenge in long sequences, mainly due to the inherent limitations of existing architectures in capturing global temporal dependencies. We introduce WISE-TTT, a synergistic architecture integrating Test-Time Training (TTT) mechanisms with the Transformer architecture through co-design. The TTT layer systematically compresses historical temporal data to generate hidden states containing worldwide information(Lossless memory to maintain long contextual integrity), while achieving multi-stage contextual aggregation through splicing. Crucially, our framework provides the first empirical validation that implementing worldwide information across multiple network layers is essential for optimal dependency this http URL studies show TTT modules at high-level features boost global modeling. This translates to 3.1% accuracy improvement(JF metric) on Davis2017 long-term benchmarks -- the first proof of hierarchical context superiority in video segmentation. We provide the first systematic evidence that worldwide information critically impacts segmentation performance.

26. 【2504.00870】Data-free Knowledge Distillation with Diffusion Models

链接https://arxiv.org/abs/2504.00870

作者:Xiaohua Qi,Renda Li,Long Peng,Qiang Ling,Jun Yu,Ziyi Chen,Peng Chang,Mei Han,Jing Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:student neural network, Recently Data-Free Knowledge, Data-Free Knowledge Distillation, neural network, teacher neural network

备注: Accepted by ICME2025

点击查看摘要

Abstract:Recently Data-Free Knowledge Distillation (DFKD) has garnered attention and can transfer knowledge from a teacher neural network to a student neural network without requiring any access to training data. Although diffusion models are adept at synthesizing high-fidelity photorealistic images across various domains, existing methods cannot be easiliy implemented to DFKD. To bridge that gap, this paper proposes a novel approach based on diffusion models, DiffDFKD. Specifically, DiffDFKD involves targeted optimizations in two key areas. Firstly, DiffDFKD utilizes valuable information from teacher models to guide the pre-trained diffusion models' data synthesis, generating datasets that mirror the training data distribution and effectively bridge domain gaps. Secondly, to reduce computational burdens, DiffDFKD introduces Latent CutMix Augmentation, an efficient technique, to enhance the diversity of diffusion model-generated images for DFKD while preserving key attributes for effective knowledge transfer. Extensive experiments validate the efficacy of DiffDFKD, yielding state-of-the-art results exceeding existing DFKD approaches. We release our code at this https URL.

27. 【2504.00867】Feature-Preserving Mesh Decimation for Normal Integration

链接https://arxiv.org/abs/2504.00867

作者:Moritz Heep,Sven Behnke,Eduard Zell

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:normal maps obtained, Normal integration reconstructs, photometric stereo, Normal integration, normal maps

备注

点击查看摘要

Abstract:Normal integration reconstructs 3D surfaces from normal maps obtained e.g. by photometric stereo. These normal maps capture surface details down to the pixel level but require large computational resources for integration at high resolutions. In this work, we replace the dense pixel grid with a sparse anisotropic triangle mesh prior to normal integration. We adapt the triangle mesh to the local geometry in the case of complex surface structures and remove oversampling from flat featureless regions. For high-resolution images, the resulting compression reduces normal integration runtimes from hours to minutes while maintaining high surface accuracy. Our main contribution is the derivation of the well-known quadric error measure from mesh decimation for screen space applications and its combination with optimal Delaunay triangulation.

28. 【2504.00862】Balancing Multi-Target Semi-Supervised Medical Image Segmentation with Collaborative Generalist and Specialists

链接https://arxiv.org/abs/2504.00862

作者:You Wang,Zekun Li,Lei Qi,Qian Yu,Yinghuan Shi,Yang Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current semi-supervised models, segmenting individual medical, individual medical targets, promising performance achieved, segmenting multiple targets

备注

点击查看摘要

Abstract:Despite the promising performance achieved by current semi-supervised models in segmenting individual medical targets, many of these models suffer a notable decrease in performance when tasked with the simultaneous segmentation of multiple targets. A vital factor could be attributed to the imbalanced scales among different targets: during simultaneously segmenting multiple targets, large targets dominate the loss, leading to small targets being misclassified as larger ones. To this end, we propose a novel method, which consists of a Collaborative Generalist and several Specialists, termed CGS. It is centered around the idea of employing a specialist for each target class, thus avoiding the dominance of larger targets. The generalist performs conventional multi-target segmentation, while each specialist is dedicated to distinguishing a specific target class from the remaining target classes and the background. Based on a theoretical insight, we demonstrate that CGS can achieve a more balanced training. Moreover, we develop cross-consistency losses to foster collaborative learning between the generalist and the specialists. Lastly, regarding their intrinsic relation that the target class of any specialized head should belong to the remaining classes of the other heads, we introduce an inter-head error detection module to further enhance the quality of pseudo-labels. Experimental results on three popular benchmarks showcase its superior performance compared to state-of-the-art methods.

29. 【2504.00859】NeuRadar: Neural Radiance Fields for Automotive Radar Point Clouds

链接https://arxiv.org/abs/2504.00859

作者:Mahan Rafidashti,Ji Lan,Maryam Fatemi,Junsheng Fu,Lars Hammarstrand,Lennart Svensson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving, lighting conditions, important sensor, sensor for autonomous, robustness to adverse

备注

点击查看摘要

Abstract:Radar is an important sensor for autonomous driving (AD) systems due to its robustness to adverse weather and different lighting conditions. Novel view synthesis using neural radiance fields (NeRFs) has recently received considerable attention in AD due to its potential to enable efficient testing and validation but remains unexplored for radar point clouds. In this paper, we present NeuRadar, a NeRF-based model that jointly generates radar point clouds, camera images, and lidar point clouds. We explore set-based object detection methods such as DETR, and propose an encoder-based solution grounded in the NeRF geometry for improved generalizability. We propose both a deterministic and a probabilistic point cloud representation to accurately model the radar behavior, with the latter being able to capture radar's stochastic behavior. We achieve realistic reconstruction results for two automotive datasets, establishing a baseline for NeRF-based radar point cloud simulation models. In addition, we release radar data for ZOD's Sequences and Drives to enable further research in this field. To encourage further development of radar NeRFs, we release the source code for NeuRadar.

30. 【2504.00857】Exploring Personalized Federated Learning Architectures for Violence Detection in Surveillance Videos

链接https://arxiv.org/abs/2504.00857

作者:Mohammad Kassir,Siba Haidar,Antoun Yaacoub

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:detecting violent incidents, Personalized Federated Learning, Personalization Layers method, Federated Learning, challenge of detecting

备注: 7 pages, 5 figures, 4 tables

点击查看摘要

Abstract:The challenge of detecting violent incidents in urban surveillance systems is compounded by the voluminous and diverse nature of video data. This paper presents a targeted approach using Personalized Federated Learning (PFL) to address these issues, specifically employing the Federated Learning with Personalization Layers method within the Flower framework. Our methodology adapts learning models to the unique data characteristics of each surveillance node, effectively managing the heterogeneous and non-IID nature of surveillance video data. Through rigorous experiments conducted on balanced and imbalanced datasets, our PFL models demonstrated enhanced accuracy and efficiency, achieving up to 99.3% accuracy. This study underscores the potential of PFL to significantly improve the scalability and effectiveness of surveillance systems, offering a robust, privacy-preserving solution for violence detection in complex urban environments.

31. 【2504.00850】Global Intervention and Distillation for Federated Out-of-Distribution Generalization

链接https://arxiv.org/abs/2504.00850

作者:Zhuang Qi,Runhui Zhang,Lei Meng,Wei Wu,Yachong Zhang,Xiangxu Meng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:inconsistent optimization directions, federated learning leads, optimization directions, unstable convergence, skew in federated

备注

点击查看摘要

Abstract:Attribute skew in federated learning leads local models to focus on learning non-causal associations, guiding them towards inconsistent optimization directions, which inevitably results in performance degradation and unstable convergence. Existing methods typically leverage data augmentation to enhance sample diversity or employ knowledge distillation to learn invariant representations. However, the instability in the quality of generated data and the lack of domain information limit their performance on unseen samples. To address these issues, this paper presents a global intervention and distillation method, termed FedGID, which utilizes diverse attribute features for backdoor adjustment to break the spurious association between background and label. It includes two main modules, where the global intervention module adaptively decouples objects and backgrounds in images, injects background information into random samples to intervene in the sample distribution, which links backgrounds to all categories to prevent the model from treating background-label associations as causal. The global distillation module leverages a unified knowledge base to guide the representation learning of client models, preventing local models from overfitting to client-specific attributes. Experimental results on three datasets demonstrate that FedGID enhances the model's ability to focus on the main subjects in unseen data and outperforms existing methods in collaborative modeling.

32. 【2504.00848】Zero-Shot 4D Lidar Panoptic Segmentation

链接https://arxiv.org/abs/2504.00848

作者:Yushan Zhang,Aljoša Ošep,Laura Leal-Taixé,Tim Meinhardt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Object Segmentation, Lidar Panoptic Segmentation, Zero-Shot Lidar Panoptic, embodied navigation, mapping and localization

备注

点击查看摘要

Abstract:Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of this http URL overcome these challenges, we propose SAL-4D (Segment Anything in Lidar--4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over $5$ PQ, and unlock Zero-Shot 4D-LPS.

33. 【2504.00844】PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks

链接https://arxiv.org/abs/2504.00844

作者:Abdelrahman Elskhawy,Mengze Li,Nassir Navab,Benjamin Busam

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Scene Graphs Generation, extracts structured representation, Graphs Generation, Scene Graphs, extracts structured

备注

点击查看摘要

Abstract:In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. This facilitates image-based understanding and reasoning for various downstream tasks. Although fully supervised SGG approaches showed steady performance improvements, they suffer from a severe training bias. This is caused by the availability of only small subsets of curated data and exhibits long-tail predicate distribution issues with a lack of predicate diversity adversely affecting downstream tasks. To overcome this, we introduce PRISM-0, a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach to capture the whole spectrum of diverse, open-vocabulary predicate prediction. Detected object pairs are filtered and passed to a Vision Language Model (VLM) that generates descriptive captions. These are used to prompt an LLM to generate fine-andcoarse-grained predicates for the pair. The predicates are then validated using a VQA model to provide a final SGG. With the modular and dataset-independent PRISM-0, we can enrich existing SG datasets such as Visual Genome (VG). Experiments illustrate that PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval with a performance on par to the best fully supervised methods.

34. 【2504.00816】he study of non-complete-ring positron emission tomography (PET) detection method

链接https://arxiv.org/abs/2504.00816

作者:Yeqi Fang,Rong Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:Positron Emission Tomography, Positron Emission, Emission Tomography, vital molecular imaging, molecular imaging tool

备注: 18 pages, 14 pages

点击查看摘要

Abstract:Positron Emission Tomography (PET) is a vital molecular imaging tool widely used in medical diagnosis and treatment evaluation. Traditional PET systems typically rely on complete detector rings to achieve full angular coverage for uniform and statistically robust sampling of coincidence events. However, incomplete-ring PET scanners have emerged in various scenarios due to hardware failures, cost constraints, or specific clinical needs. In such cases, conventional reconstruction algorithms often suffer from performance degradation due to reduced data completeness and geometric inconsistencies. This thesis proposes a coarse-to-fine reconstruction framework for incomplete-ring PET scanners. The framework first employs an Attention U-Net model to recover complete sinograms from incomplete ones, then uses the OSEM algorithm for preliminary reconstruction, and finally applies a two-stage architecture comprising a Coarse Prediction Module (CPM) and an Iterative Refinement Module (IRM) for fine reconstruction. Our approach utilizes neighboring axial slices and spectral transform features as auxiliary guidance at the input level to ensure spatial and frequency domain consistency, and integrates a contrastive diffusion strategy at the output level to improve correspondence between low-quality PET inputs and refined PET outputs. Experimental results on public and in-house brain PET datasets demonstrate that the proposed method significantly outperforms existing approaches in metrics such as PSNR (35.6421 dB) and SSIM (0.9588), successfully preserving key anatomical structures and tracer distribution features, thus providing an effective solution for incomplete-ring PET imaging.

35. 【2504.00812】Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

链接https://arxiv.org/abs/2504.00812

作者:Yiqun Duan,Sameera Ramasinghe,Stephen Gould,Ajanthan Thalaiyasingam

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:retrieving images matching, reference image augmented, reference image, CIR, task of retrieving

备注

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.

36. 【2504.00784】CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

链接https://arxiv.org/abs/2504.00784

作者:Yang Yang,Xijie Xu,Yixun Zhou,Jie Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:broad clinical applications, Cell instance segmentation, Cell Vision Transformer, clinical applications, Cell instance

备注

点击查看摘要

Abstract:Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at this https URL.

37. 【2504.00775】Visual Environment-Interactive Planning for Embodied Complex-Question Answering

链接https://arxiv.org/abs/2504.00775

作者:Ning Lan,Baoshan Ou,Xuemei Xie,Guangming Shi

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Embodied Complex-Question Answering, Complex-Question Answering task, Complex-Question Answering, Embodied Complex-Question, Answering task

备注

点击查看摘要

Abstract:This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

38. 【2504.00773】DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting

链接https://arxiv.org/abs/2504.00773

作者:Hyunwoo Park,Gun Ryu,Wonjun Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained considerable attentions, excellent image quality, view synthesis due, gained considerable, considerable attentions

备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity. The code and model are publicly available at: this https URL release.

39. 【2504.00772】Multi-Task Neural Architecture Search Using Architecture Embedding and Transfer Rank

链接https://arxiv.org/abs/2504.00772

作者:TingJie Zhang,HaiLin Liu

类目:Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

关键词:enables transferring architectural, transferring architectural knowledge, enables transferring, transferring architectural, architectural knowledge

备注

点击查看摘要

Abstract:Multi-task neural architecture search (NAS) enables transferring architectural knowledge among different tasks. However, ranking disorder between the source task and the target task degrades the architecture performance on the downstream task. We propose KTNAS, an evolutionary cross-task NAS algorithm, to enhance transfer efficiency. Our data-agnostic method converts neural architectures into graphs and uses architecture embedding vectors for the subsequent architecture performance prediction. The concept of transfer rank, an instance-based classifier, is introduced into KTNAS to address the performance degradation issue. We verify the search efficiency on NASBench-201 and transferability to various vision tasks on Micro TransNAS-Bench-101. The scalability of our method is demonstrated on DARTs search space including CIFAR-10/100, MNIST/Fashion-MNIST, MedMNIST. Experimental results show that KTNAS outperforms peer multi-task NAS algorithms in search efficiency and downstream task performance. Ablation studies demonstrate the vital importance of transfer rank for transfer performance.

40. 【2504.00763】UnIRe: Unsupervised Instance Decomposition for Dynamic Urban Scene Reconstruction

链接https://arxiv.org/abs/2504.00763

作者:Yunxuan Mao,Rong Xiong,Yue Wang,Yiyi Liao

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Reconstructing and decomposing, decomposing dynamic urban, http URL, urban planning, autonomous driving

备注

点击查看摘要

Abstract:Reconstructing and decomposing dynamic urban scenes is crucial for autonomous driving, urban planning, and scene editing. However, existing methods fail to perform instance-aware decomposition without manual annotations, which is crucial for instance-level scene this http URL propose UnIRe, a 3D Gaussian Splatting (3DGS) based approach that decomposes a scene into a static background and individual dynamic instances using only RGB images and LiDAR point clouds. At its core, we introduce 4D superpoints, a novel representation that clusters multi-frame LiDAR points in 4D space, enabling unsupervised instance separation based on spatiotemporal correlations. These 4D superpoints serve as the foundation for our decomposed 4D initialization, i.e., providing spatial and temporal initialization to train a dynamic 3DGS for arbitrary dynamic classes without requiring bounding boxes or object this http URL, we introduce a smoothness regularization strategy in both 2D and 3D space, further improving the temporal this http URL on benchmark datasets show that our method outperforms existing methods in decomposed dynamic scene reconstruction while enabling accurate and flexible instance-level editing, making it a practical solution for real-world applications.

41. 【2504.00759】MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration

链接https://arxiv.org/abs/2504.00759

作者:Dehua Huo,Weida Zhan,Jinxin Guo,Depeng Zhu,Yu Chen,YiChun Jiang,Yueyi Han,Deng Han,Jin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:imagery primarily involves, sensing imagery primarily, building extraction, remote sensing imagery, Multi-scale Feature Extraction

备注

点击查看摘要

Abstract:Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection. However, most existing methods address these tasks independently, overlooking their inherent correlation and failing to exploit shared feature representations for mutual enhancement. Furthermore, the diverse spectral,spatial, and scale characteristics of buildings pose additional challenges in jointly modeling spatial-spectral multi-scale features and effectively balancing precision and recall. The limited synergy between spatial and spectral representations often results in reduced detection accuracy and incomplete change this http URL address these challenges, we propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images. The framework integrates both tasks within a unified architecture, leveraging their complementary nature to simultaneously extract building and change features. Specifically,a Dual-branch Multi-scale Feature Extraction module (DMFE) with Spatial-Spectral Feature Collaboration (SSFC) is designed to enhance multi-scale representation learning, effectively capturing shallow texture details and deep semantic information, thus improving building extraction performance. For temporal feature aggregation, we introduce a Multi-scale Differential Fusion Module (MDFM) that explicitly models the interaction between differential and dual-temporal features. This module refines the network's capability to detect large-area changes and subtle structural variations in buildings. Extensive experiments conducted on three benchmark datasets demonstrate that MSSFC-Net achieves superior performance in both building extraction and change detection tasks, effectively improving detection accuracy while maintaining completeness.

42. 【2504.00753】CAPE: Connectivity-Aware Path Enforcement Loss for Curvilinear Structure Delineation

链接https://arxiv.org/abs/2504.00753

作者:Elyar Esmaeilzadeh,Ehsan Garaaghaji,Farzad Hallaji Azad,Doruk Oner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:curvilinear structures, remains a key, neuronal processes, processes in biomedical, biomedical scans

备注

点击查看摘要

Abstract:Promoting the connectivity of curvilinear structures, such as neuronal processes in biomedical scans and blood vessels in CT images, remains a key challenge in semantic segmentation. Traditional pixel-wise loss functions, including cross-entropy and Dice losses, often fail to capture high-level topological connectivity, resulting in topological mistakes in graphs obtained from prediction maps. In this paper, we propose CAPE (Connectivity-Aware Path Enforcement), a novel loss function designed to enforce connectivity in graphs obtained from segmentation maps by optimizing a graph connectivity metric. CAPE uses the graph representation of the ground truth to select node pairs and determine their corresponding paths within the predicted segmentation through a shortest-path algorithm. Using this, we penalize both disconnections and false positive connections, effectively promoting the model to preserve topological correctness. Experiments on 2D and 3D datasets, including neuron and blood vessel tracing demonstrate that CAPE significantly improves topology-aware metrics and outperforms state-of-the-art methods.

43. 【2504.00719】Scaling Up Resonate-and-Fire Networks for Fast Deep Learning

链接https://arxiv.org/abs/2504.00719

作者:Thomas E. Huber,Jules Lecomte,Borislav Polovnikov,Axel von Arnim

类目:Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

关键词:event-based sensor data, promising computing paradigm, present a promising, sensor data, Spiking neural networks

备注: 19 pages, 3 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at this https URL.

44. 【2504.00691】oVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

链接https://arxiv.org/abs/2504.00691

作者:Yuanchen Wu,Junlong Du,Ke Yan,Shouhong Ding,Xiaoqiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fine-grained object recognition, requires extensive visual, learning requires extensive, extensive visual perception, visual perception capabilities

备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

45. 【2504.00665】Monocular and Generalizable Gaussian Talking Head Animation

链接https://arxiv.org/abs/2504.00665

作者:Shengjie Gong,Haojie Li,Jiapeng Tang,Dongming Hu,Shuangping Huang,Hao Chen,Tianshui Chen,Zhuoman Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Talking Head Animation, Generalizable Gaussian Talking, Gaussian Talking Head, Head Animation, Talking Head

备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.

46. 【2504.00654】QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA

链接https://arxiv.org/abs/2504.00654

作者:Shuai Li,Jian Xu,Xiao-Hui Li,Chao Deng,Lin-Lin Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, shown significant progress

备注

点击查看摘要

Abstract:Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.

47. 【2504.00647】FDDet: Frequency-Decoupling for Boundary Refinement in Temporal Action Detection

链接https://arxiv.org/abs/2504.00647

作者:Xinnan Zhu,Yicheng Zhu,Tixin Chen,Wentao Wu,Yuanjie Dang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to locate, locate and classify, untrimmed videos, action, Temporal

备注

点击查看摘要

Abstract:Temporal action detection aims to locate and classify actions in untrimmed videos. While recent works focus on designing powerful feature processors for pre-trained representations, they often overlook the inherent noise and redundancy within these features. Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and imprecise boundaries. To address this, we propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models. Specifically, we introduce an adaptive temporal decoupling scheme that suppresses irrelevant information while preserving fine-grained atomic action details, yielding more task-specific representations. In addition, we enhance inter-frame modeling by capturing temporal variations to better distinguish actions from background redundancy. Furthermore, we present a long-short-term category-aware relation network that jointly models local transitions and long-range dependencies, improving localization precision. The refined atomic features and frequency-guided dynamics are fed into a standard detection head to produce accurate action predictions. Extensive experiments on THUMOS14, HACS, and ActivityNet-1.3 show that our method, powered by InternVideo2-6B features, achieves state-of-the-art performance on temporal action detection benchmarks.

48. 【2504.00640】POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

链接https://arxiv.org/abs/2504.00640

作者:Lanyun Zhu,Tianrun Chen,Qianxiong Xu,Xuanyi Liu,Deyi Ji,Haiyang Wu,De Wen Soh,Jun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing LVLM-based reasoning, Existing LVLM-based, suffer from imprecise, text responses, imprecise segmentation results

备注: CVPR2025

点击查看摘要

Abstract:Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is this https URL

49. 【2504.00639】Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians

链接https://arxiv.org/abs/2504.00639

作者:Jiamin Wu,Hongyang Li,Xiaoke Jiang,Yuan Yao,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussians, camera parameters, Gaussians and camera, jointly optimizing camera, optimizing camera parameters

备注

点击查看摘要

Abstract:In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.

50. 【2504.00609】Bi-Grid Reconstruction for Image Anomaly Detection

链接https://arxiv.org/abs/2504.00609

作者:Huichuan Huang,Zhiqing Zhong,Guangyu Wei,Yonghao Wan,Wenlong Sun,Aimin Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:textbf, GRAD, Feature Block Paste, anomaly detection, feature

备注

点击查看摘要

Abstract:In image anomaly detection, significant advancements have been made using un- and self-supervised methods with datasets containing only normal samples. However, these approaches often struggle with fine-grained anomalies. This paper introduces \textbf{GRAD}: Bi-\textbf{G}rid \textbf{R}econstruction for Image \textbf{A}nomaly \textbf{D}etection, which employs two continuous grids to enhance anomaly detection from both normal and abnormal perspectives. In this work: 1) Grids as feature repositories that improve generalization and mitigate the Identical Shortcut (IS) issue; 2) An abnormal feature grid that refines normal feature boundaries, boosting detection of fine-grained defects; 3) The Feature Block Paste (FBP) module, which synthesizes various anomalies at the feature level for quick abnormal grid deployment. GRAD's robust representation capabilities also allow it to handle multiple classes with a single model. Evaluations on datasets like MVTecAD, VisA, and GoodsAD show significant performance improvements in fine-grained anomaly detection. GRAD excels in overall accuracy and in discerning subtle differences, demonstrating its superiority over existing methods.

51. 【2504.00606】Sample-level Adaptive Knowledge Distillation for Action Recognition

链接https://arxiv.org/abs/2504.00606

作者:Ping Li,Chenhao Ping,Wenxiao Wang,Mingli Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:compresses neural networks, pre-trained large network, neural networks, small network, large network

备注

点击查看摘要

Abstract:Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus on video analysis which desires training much larger model making it be hardly deployed in resource-limited devices. However, traditional methods neglect two important problems, i.e., 1) Since the capacity gap between the teacher and the student exists, some knowledge w.r.t. difficult-to-transfer samples cannot be correctly transferred, or even badly affects the final performance of student, and 2) As training progresses, difficult-to-transfer samples may become easier to learn, and vice versa. To alleviate the two problems, we propose a Sample-level Adaptive Knowledge Distillation (SAKD) framework for action recognition. In particular, it mainly consists of the sample distillation difficulty evaluation module and the sample adaptive distillation module. The former applies the temporal interruption to frames, i.e., randomly dropout or shuffle the frames during training, which increases the learning difficulty of samples during distillation, so as to better discriminate their distillation difficulty. The latter module adaptively adjusts distillation ratio at sample level, such that KD loss dominates the training with easy-to-transfer samples while vanilla loss dominates that with difficult-to-transfer samples. More importantly, we only select those samples with both low distillation difficulty and high diversity to train the student model for reducing computational cost. Experimental results on two video benchmarks and one image benchmark demonstrate the superiority of the proposed method by striking a good balance between performance and efficiency.

52. 【2504.00561】Continual Cross-Modal Generalization

链接https://arxiv.org/abs/2504.00561

作者:Yan Xia,Hai Huang,Minghui Fang,Zhou Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shared discrete representation, Cross-modal generalization aims, enabling knowledge transfer, aims to learn, transfer across unannotated

备注

点击查看摘要

Abstract:Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

53. 【2504.00559】AttentiveGRU: Recurrent Spatio-Temporal Modeling for Advanced Radar-Based BEV Object Detection

链接https://arxiv.org/abs/2504.00559

作者:Loveneet Saini,Mirko Meuter,Hasan Tercan,Tobias Meisen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:radar-based perception systems, advanced automotive, radar-based perception, perception systems, single-frame BEV paradigms

备注

点击查看摘要

Abstract:Bird's-eye view (BEV) object detection has become important for advanced automotive 3D radar-based perception systems. However, the inherently sparse and non-deterministic nature of radar data limits the effectiveness of traditional single-frame BEV paradigms. In this paper, we addresses this limitation by introducing AttentiveGRU, a novel attention-based recurrent approach tailored for radar constraints, which extracts individualized spatio-temporal context for objects by dynamically identifying and fusing temporally correlated structures across present and memory states. By leveraging the consistency of object's latent representation over time, our approach exploits temporal relations to enrich feature representations for both stationary and moving objects, thereby enhancing detection performance and eliminating the need for externally providing or estimating any information about ego vehicle motion. Our experimental results on the public nuScenes dataset show a significant increase in mAP for the car category by 21% over the best radar-only submission. Further evaluations on an additional dataset demonstrate notable improvements in object detection capabilities, underscoring the applicability and effectiveness of our method.

54. 【2504.00558】Archival Faces: Detection of Faces in Digitized Historical Documents

链接https://arxiv.org/abs/2504.00558

作者:Marek Vaško,Adam Herout,Michal Hradiš

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:digitizing historical archives, ordinary people, surrounding text, make them searchable, celebrities and ordinary

备注: 15 pages, 6 figures, 6 tables

点击查看摘要

Abstract:When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably -- current detection tools only achieve around $24\%$ mAP at $50:90\%$ IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the $19^{th}$ to $20^{th}$ century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.

55. 【2504.00557】Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

链接https://arxiv.org/abs/2504.00557

作者:Jewon Lee,Ki-Ung Song,Seungmin Yang,Donguk Lim,Jaeyeon Kim,Wooksu Shin,Bo-Kyeong Kim,Yong Jae Lee,Tae-Ho Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:token reduction lowers, large vision-language models, reduction lowers inference, lowers inference costs, inference costs caused

备注: accepted at CVPR 2025 Workshop on ELVM

点击查看摘要

Abstract:Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

56. 【2504.00543】Generalization-aware Remote Sensing Change Detection via Domain-agnostic Learning

链接https://arxiv.org/abs/2504.00543

作者:Qi Zang,Shuang Wang,Dong Zhao,Dou Quan,Yang Hu,Licheng Jiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:imaging environmental factors, Change detection, bitemporal images induced, region development, key challenges

备注

点击查看摘要

Abstract:Change detection has essential significance for the region's development, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing transformation-based methods regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). However, their efforts are limited by two drawbacks: 1) Transformed images suffer from distortion that reduces feature discrimination. 2) Alignment hampers the model from learning domain-agnostic representations that degrades performance on scenes with domain shifts from the training data. Therefore, oriented from pseudo-changes caused by style differences, we present a generalizable domain-agnostic difference learning network (DonaNet). For the drawback 1), we argue for local-level statistics as style proxies to assist against domain shifts. For the drawback 2), DonaNet learns domain-agnostic representations by removing domain-specific style of encoded features and highlighting the class characteristics of objects. In the removal, we propose a domain difference removal module to reduce feature variance while preserving discriminative properties and propose its enhanced version to provide possibilities for eliminating more style by decorrelating the correlation between features. In the highlighting, we propose a cross-temporal generalization learning strategy to imitate latent domain shifts, thus enabling the model to extract feature representations more robust to shifts actively. Extensive experiments conducted on three public datasets demonstrate that DonaNet outperforms existing state-of-the-art methods with a smaller model size and is more robust to domain shift.

57. 【2504.00527】SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

链接https://arxiv.org/abs/2504.00527

作者:Fida Mohammad Thoker,Letian Jiang,Chen Zhao,Bernard Ghanem

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Masked video modeling, Masked video, video, Masked, SMILE

备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: this https URL

58. 【2504.00526】High-Quality Pseudo-Label Generation Based on Visual Prompt Assisted Cloud Model Update

链接https://arxiv.org/abs/2504.00526

作者:Xinrun Xu,Qiuhong Zhang,Jianwen Yang,Zhanbiao Lian,Jin Yan,Zhiming Ding,Shan Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Generating high-quality pseudo-labels, data distributions evolve, Generating high-quality, dynamic traffic monitoring, Visual Prompt Generator

备注: IJCNN'25

点击查看摘要

Abstract:Generating high-quality pseudo-labels on the cloud is crucial for cloud-edge object detection, especially in dynamic traffic monitoring where data distributions evolve. Existing methods often assume reliable cloud models, neglecting potential errors or struggling with complex distribution shifts. This paper proposes Cloud-Adaptive High-Quality Pseudo-label generation (CA-HQP), addressing these limitations by incorporating a learnable Visual Prompt Generator (VPG) and dual feature alignment into cloud model updates. The VPG enables parameter-efficient adaptation by injecting visual prompts, enhancing flexibility without extensive fine-tuning. CA-HQP mitigates domain discrepancies via two feature alignment techniques: global Domain Query Feature Alignment (DQFA) capturing scene-level shifts, and fine-grained Temporal Instance-Aware Feature Embedding Alignment (TIAFA) addressing instance variations. Experiments on the Bellevue traffic dataset demonstrate that CA-HQP significantly improves pseudo-label quality compared to existing methods, leading to notable performance gains for the edge model and showcasing CA-HQP's adaptation effectiveness. Ablation studies validate each component (DQFA, TIAFA, VPG) and the synergistic effect of combined alignment strategies, highlighting the importance of adaptive cloud updates and domain adaptation for robust object detection in evolving scenarios. CA-HQP provides a promising solution for enhancing cloud-edge object detection systems in real-world applications.

59. 【2504.00525】Robust LiDAR-Camera Calibration with 2D Gaussian Splatting

链接https://arxiv.org/abs/2504.00525

作者:Shuyi Zhou,Shuxiang Xie,Ryoichi Ishikawa,Takeshi Oishi

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:robotics recently, increasingly popular, popular in robotics, LiDAR-camera systems, LiDAR-camera extrinsic parameters

备注: Accepted in IEEE Robotics and Automation Letters. Code available at: [this https URL](https://github.com/ShuyiZhou495/RobustCalibration)

点击查看摘要

Abstract:LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.

60. 【2504.00515】raining Frozen Feature Pyramid DINOv2 for Eyelid Measurements with Infinite Encoding and Orthogonal Regularization

链接https://arxiv.org/abs/2504.00515

作者:Chun-Hung Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:Margin Reflex Distances, Reflex Distances, Levator Function, Margin Reflex, inconsistent methods

备注

点击查看摘要

Abstract:Accurate measurement of eyelid parameters such as Margin Reflex Distances (MRD1, MRD2) and Levator Function (LF) is critical in oculoplastic diagnostics but remains limited by manual, inconsistent methods. This study evaluates deep learning models: SE-ResNet, EfficientNet, and the vision transformer-based DINOv2 for automating these measurements using smartphone-acquired images. We assess performance across frozen and fine-tuned settings, using MSE, MAE, and R2 metrics. DINOv2, pretrained through self-supervised learning, demonstrates superior scalability and robustness, especially under frozen conditions ideal for mobile deployment. Lightweight regressors such as MLP and Deep Ensemble offer high precision with minimal computational overhead. To address class imbalance and improve generalization, we integrate focal loss, orthogonal regularization, and binary encoding strategies. Our results show that DINOv2 combined with these enhancements delivers consistent, accurate predictions across all tasks, making it a strong candidate for real-world, mobile-friendly clinical applications. This work highlights the potential of foundation models in advancing AI-powered ophthalmic care.

61. 【2504.00502】ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

链接https://arxiv.org/abs/2504.00502

作者:Qianhao Yuan,Qingyu Zhang,Yanjiang Liu,Jiawei Chen,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注: Project page: [this https URL](https://github.com/icip-cas/ShortV)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at this https URL

62. 【2504.00496】Learned Image Compression with Dictionary-based Entropy Model

链接https://arxiv.org/abs/2504.00496

作者:Jingbo Lu,Leheng Zhang,Xingyu Zhou,Mu Li,Wen Li,Shuhang Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Learned image compression, attracted great research, great research interest, exhibited superior rate-distortion, classical image compression

备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Learned image compression methods have attracted great research interest and exhibited superior rate-distortion performance to the best classical image compression standards of the present. The entropy model plays a key role in learned image compression, which estimates the probability distribution of the latent representation for further entropy coding. Most existing methods employed hyper-prior and auto-regressive architectures to form their entropy models. However, they only aimed to explore the internal dependencies of latent representation while neglecting the importance of extracting prior from training data. In this work, we propose a novel entropy model named Dictionary-based Cross Attention Entropy model, which introduces a learnable dictionary to summarize the typical structures occurring in the training dataset to enhance the entropy model. Extensive experimental results have demonstrated that the proposed model strikes a better balance between performance and latency, achieving state-of-the-art results on various benchmark datasets.

63. 【2504.00490】SCFANet: Style Distribution Constraint Feature Alignment Network For Pathological Staining Translation

链接https://arxiv.org/abs/2504.00490

作者:Zetong Chen,Yuzhuo Chen,Hai Zhong,Xu Qiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:detecting specific antigens, IHC staining style, IHC staining, IHC staining process, IHC

备注

点击查看摘要

Abstract:Immunohistochemical (IHC) staining serves as a valuable technique for detecting specific antigens or proteins through antibody-mediated visualization. However, the IHC staining process is both time-consuming and costly. To address these limitations, the application of deep learning models for direct translation of cost-effective Hematoxylin and Eosin (HE) stained images into IHC stained images has emerged as an efficient solution. Nevertheless, the conversion from HE to IHC images presents significant challenges, primarily due to alignment discrepancies between image pairs and the inherent diversity in IHC staining style patterns. To overcome these challenges, we propose the Style Distribution Constraint Feature Alignment Network (SCFANet), which incorporates two innovative modules: the Style Distribution Constrainer (SDC) and Feature Alignment Learning (FAL). The SDC ensures consistency between the generated and target images' style distributions while integrating cycle consistency loss to maintain structural consistency. To mitigate the complexity of direct image-to-image translation, the FAL module decomposes the end-to-end translation task into two subtasks: image reconstruction and feature alignment. Furthermore, we ensure pathological consistency between generated and target images by maintaining pathological pattern consistency and Optical Density (OD) uniformity. Extensive experiments conducted on the Breast Cancer Immunohistochemical (BCI) dataset demonstrate that our SCFANet model outperforms existing methods, achieving precise transformation of HE-stained images into their IHC-stained counterparts. The proposed approach not only addresses the technical challenges in HE to IHC image translation but also provides a robust framework for accurate and efficient stain conversion in pathological analysis.

64. 【2504.00487】FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

链接https://arxiv.org/abs/2504.00487

作者:Jie Ma,Zhitao Gao,Qi Chai,Jun Liu,Pinghui Wang,Jing Tao,Zhou Su

类目:Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:audio-video inputs accurately, reasoning task requiring, task requiring intelligent, requiring intelligent systems, answer natural language

备注: Under Review

点击查看摘要

Abstract:Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at this https URL.

65. 【2504.00481】Hierarchical Attention Networks for Lossless Point Cloud Attribute Compression

链接https://arxiv.org/abs/2504.00481

作者:Yueru Chen,Wei Zhang,Dingquan Li,Jing Wang,Ge Li

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:multi-resolution spatial structure, deep hierarchical attention, leveraging a multi-resolution, residual learning, propose a deep

备注: Accepted by DCC 2025

点击查看摘要

Abstract:In this paper, we propose a deep hierarchical attention context model for lossless attribute compression of point clouds, leveraging a multi-resolution spatial structure and residual learning. A simple and effective Level of Detail (LoD) structure is introduced to yield a coarse-to-fine representation. To enhance efficiency, points within the same refinement level are encoded in parallel, sharing a common context point group. By hierarchically aggregating information from neighboring points, our attention model learns contextual dependencies across varying scales and densities, enabling comprehensive feature extraction. We also adopt normalization for position coordinates and attributes to achieve scale-invariant compression. Additionally, we segment the point cloud into multiple slices to facilitate parallel processing, further optimizing time complexity. Experimental results demonstrate that the proposed method offers better coding performance than the latest G-PCC for color and reflectance attributes while maintaining more efficient encoding and decoding runtimes.

66. 【2504.00478】FSSUWNet: Mitigating the Fragility of Pre-trained Models with Feature Enhancement for Few-Shot Semantic Segmentation in Underwater Images

链接https://arxiv.org/abs/2504.00478

作者:Zhuohao Li,Zhicheng Huang,Wenchao Liu,Zhuxing Zhang,Jianming Miao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-Shot Semantic Segmentation, Few-Shot Semantic, Semantic Segmentation, data-scarce domains, focuses on segmenting

备注

点击查看摘要

Abstract:Few-Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data-scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre-trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low-level and high-level image characteristics. In addition to employing a pre-trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low-level features with high-level features in dimensions. Given the scarcity of underwater images, we introduce a cross-validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state-of-the-art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1-shot and 5-shot scenarios in the datasets, respectively. Our implementation is available at this https URL.

67. 【2504.00476】4th PVUW MeViS 3rd Place Report: Sa2VA

链接https://arxiv.org/abs/2504.00476

作者:Haobo Yuan,Tao Zhang,Xiangtai Li,Lu Qi,Zilong Huang,Shilin Xu,Jiashi Feng,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video object segmentation, existing RVOS benchmarks, Referring video object, referring expression tasks, object segmentation

备注: Technical Report, 4 pages, Code: [this https URL](https://github.com/magic-research/Sa2VA)

点击查看摘要

Abstract:Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

68. 【2504.00470】Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection

链接https://arxiv.org/abs/2504.00470

作者:Ruoyu Chen,Siyuan Liang,Jingzhi Li,Shiming Liu,Li Liu,Hua Zhang,Xiaochun Cao

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:trustworthy AI system, develop a trustworthy, aim to identify, attribution, input

备注

点击查看摘要

Abstract:To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at this https URL.

69. 【2504.00463】Exploring the Collaborative Advantage of Low-level Information on Generalizable AI-Generated Image Detection

链接https://arxiv.org/abs/2504.00463

作者:Ziyin Zhou,Ke Sun,Zhongxi Chen,Xianming Lin,Yunpeng Luo,Ke Yan,Shouhong Ding,Xiaoshuai Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low-level information, AI-Generated image detection, extracting low-level information, AI-Generated image, low-level

备注

点击查看摘要

Abstract:Existing state-of-the-art AI-Generated image detection methods mostly consider extracting low-level information from RGB images to help improve the generalization of AI-Generated image detection, such as noise patterns. However, these methods often consider only a single type of low-level information, which may lead to suboptimal generalization. Through empirical analysis, we have discovered a key insight: different low-level information often exhibits generalization capabilities for different types of forgeries. Furthermore, we found that simple fusion strategies are insufficient to leverage the detection advantages of each low-level and high-level information for various forgery types. Therefore, we propose the Adaptive Low-level Experts Injection (ALEI) framework. Our approach introduces Lora Experts, enabling the backbone network, which is trained with high-level semantic RGB images, to accept and learn knowledge from different low-level information. We utilize a cross-attention method to adaptively fuse these features at intermediate layers. To prevent the backbone network from losing the modeling capabilities of different low-level features during the later stages of modeling, we developed a Low-level Information Adapter that interacts with the features extracted by the backbone network. Finally, we propose Dynamic Feature Selection, which dynamically selects the most suitable features for detecting the current image to maximize generalization detection capability. Extensive experiments demonstrate that our method, finetuned on only four categories of mainstream ProGAN data, performs excellently and achieves state-of-the-art results on multiple datasets containing unseen GAN and Diffusion methods.

70. 【2504.00458】Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection

链接https://arxiv.org/abs/2504.00458

作者:Shunxin Chen,Ajian Liu,Junze Zheng,Jun Wan,Kailai Peng,Sergio Escalera,Zhen Lei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial recognition systems, Facial recognition, recognition systems, systems in real-world, real-world scenarios

备注: 9 pages, 5 figures, accepted by AAAI-2025 (Oral)

点击查看摘要

Abstract:Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.

71. 【2504.00457】Distilling Multi-view Diffusion Models into 3D Generators

链接https://arxiv.org/abs/2504.00457

作者:Hao Qin,Luyuan Chen,Ming Kong,Mengxu Lu,Qiang Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multi-view Diffusion model, formulation that Distills, Distills a multi-view, multi-view Diffusion, Diffusion model

备注

点击查看摘要

Abstract:We introduce DD3G, a formulation that Distills a multi-view Diffusion model (MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and integrates extensive visual and spatial geometric knowledge from the MV-DM by simulating its ordinary differential equation (ODE) trajectory, ensuring the distilled generator generalizes better than those trained solely on 3D data. Unlike previous amortized optimization approaches, we align the MV-DM and 3D generator representation spaces to transfer the teacher's probabilistic flow to the student, thus avoiding inconsistencies in optimization objectives caused by probabilistic sampling. The introduction of probabilistic flow and the coupling of various attributes in 3D Gaussians introduce challenges in the generation process. To tackle this, we propose PEPD, a generator consisting of Pattern Extraction and Progressive Decoding phases, which enables efficient fusion of probabilistic flow and converts a single image into 3D Gaussians within 0.06 seconds. Furthermore, to reduce knowledge loss and overcome sparse-view supervision, we design a joint optimization objective that ensures the quality of generated samples through explicit supervision and implicit verification. Leveraging existing 2D generation models, we compile 120k high-quality RGBA images for distillation. Experiments on synthetic and public datasets demonstrate the effectiveness of our method. Our project is available at: this https URL

72. 【2504.00454】FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection

链接https://arxiv.org/abs/2504.00454

作者:Yongze Li,Ning Li,Ajian Liu,Hui Ma,Liying Yang,Xihong Chen,Zhiyao Liang,Yanyan Liang,Jun Wan,Zhen Lei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial recognition systems, Facial recognition, unified attack detection, printed photos, digital face attacks

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.

73. 【2504.00438】Suite-IN++: A FlexiWear BodyNet Integrating Global and Local Motion Features from Apple Suite for Robust Inertial Navigation

链接https://arxiv.org/abs/2504.00438

作者:Lan Sun,Songpengcheng Xia,Jiarui Yang,Ling Pei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ecosystems comprising smartphones, established multi-device ecosystems, multi-device ecosystems comprising, comprising smartphones, technology has established

备注: 15 pages,10 figures

点击查看摘要

Abstract:The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.

74. 【2504.00437】ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving with Multi-modal Inputs

链接https://arxiv.org/abs/2504.00437

作者:Qi Song,Chenghong Li,Haotong Lin,Sida Peng,Rui Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:street scene reconstruction, generalizable street scene, scene reconstruction, generalizable street, street scene

备注: The project page can be found at [this https URL](https://maggiesong7.github.io/research/ADGaussian/)

点击查看摘要

Abstract:We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

75. 【2504.00432】DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding

链接https://arxiv.org/abs/2504.00432

作者:Chong Li,Jingyang Huo,Weikang Gong,Yanwei Fu,Xiangyang Xue,Jianfeng Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Decoding visual experiences, visual experiences, brain activity, significant challenge, Decoding visual

备注

点击查看摘要

Abstract:Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: this https URL.

76. 【2504.00431】Enhancing Fundus Image-based Glaucoma Screening via Dynamic Global-Local Feature Integration

链接https://arxiv.org/abs/2504.00431

作者:Yuzhuo Zhou,Chi Liu,Sheng Shen,Siyu Le,Liwen Yu,Sihan Ouyang,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical artificial intelligence, fundus image classifiers, artificial intelligence, ophthalmic diagnosis, advancements in medical

备注

点击查看摘要

Abstract:With the advancements in medical artificial intelligence (AI), fundus image classifiers are increasingly being applied to assist in ophthalmic diagnosis. While existing classification models have achieved high accuracy on specific fundus datasets, they struggle to address real-world challenges such as variations in image quality across different imaging devices, discrepancies between training and testing images across different racial groups, and the uncertain boundaries due to the characteristics of glaucomatous cases. In this study, we aim to address the above challenges posed by image variations by highlighting the importance of incorporating comprehensive fundus image information, including the optic cup (OC) and optic disc (OD) regions, and other key image patches. Specifically, we propose a self-adaptive attention window that autonomously determines optimal boundaries for enhanced feature extraction. Additionally, we introduce a multi-head attention mechanism to effectively fuse global and local features via feature linear readout, improving the model's discriminative capability. Experimental results demonstrate that our method achieves superior accuracy and robustness in glaucoma classification.

77. 【2504.00430】Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion

链接https://arxiv.org/abs/2504.00430

作者:Yuxi Mi,Zhizhou Zhong,Yuge Huang,Qiuyang Yuan,Xuan Zhao,Jianqing Xu,Shouhong Ding,ShaoMing Wang,Rizen Guo,Shuigeng Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Identity-preserving face synthesis, face synthesis aims, substitute real-world data, Identity-preserving face, synthesis aims

备注: CVPR 2025

点击查看摘要

Abstract:Identity-preserving face synthesis aims to generate synthetic face images of virtual subjects that can substitute real-world data for training face recognition models. While prior arts strive to create images with consistent identities and diverse styles, they face a trade-off between them. Identifying their limitation of treating style variation as subject-agnostic and observing that real-world persons actually have distinct, subject-specific styles, this paper introduces MorphFace, a diffusion-based face generator. The generator learns fine-grained facial styles, e.g., shape, pose and expression, from the renderings of a 3D morphable model (3DMM). It also learns identities from an off-the-shelf recognition model. To create virtual faces, the generator is conditioned on novel identities of unlabeled synthetic faces, and novel styles that are statistically sampled from a real-world prior distribution. The sampling especially accounts for both intra-subject variation and subject distinctiveness. A context blending strategy is employed to enhance the generator's responsiveness to identity and style conditions. Extensive experiments show that MorphFace outperforms the best prior arts in face recognition efficacy.

78. 【2504.00429】Unleashing the Power of Pre-trained Encoders for Universal Adversarial Attack Detection

链接https://arxiv.org/abs/2504.00429

作者:Yinghe Zhang,Chi Liu,Shuai Zhou,Sheng Shen,Peng Gui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:injecting human-imperceptible perturbations, deep learning models, critical security threat, Adversarial attacks pose, pose a critical

备注

点击查看摘要

Abstract:Adversarial attacks pose a critical security threat to real-world AI systems by injecting human-imperceptible perturbations into benign samples to induce misclassification in deep learning models. While existing detection methods, such as Bayesian uncertainty estimation and activation pattern analysis, have achieved progress through feature engineering, their reliance on handcrafted feature design and prior knowledge of attack patterns limits generalization capabilities and incurs high engineering costs. To address these limitations, this paper proposes a lightweight adversarial detection framework based on the large-scale pre-trained vision-language model CLIP. Departing from conventional adversarial feature characterization paradigms, we innovatively adopt an anomaly detection perspective. By jointly fine-tuning CLIP's dual visual-text encoders with trainable adapter networks and learnable prompts, we construct a compact representation space tailored for natural images. Notably, our detection architecture achieves substantial improvements in generalization capability across both known and unknown attack patterns compared to traditional methods, while significantly reducing training overhead. This study provides a novel technical pathway for establishing a parameter-efficient and attack-agnostic defense paradigm, markedly enhancing the robustness of vision systems against evolving adversarial threats.

79. 【2504.00421】Can LLMs Assist Computer Education? an Empirical Case Study of DeepSeek

链接https://arxiv.org/abs/2504.00421

作者:Dongfu Xiao,Chen Gao,Zhengquan Luo,Chi Liu,Sheng Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:empirical case study, study presents, case study, presents an empirical, empirical case

备注

点击查看摘要

Abstract:This study presents an empirical case study to assess the efficacy and reliability of DeepSeek-V3, an emerging large language model, within the context of computer education. The evaluation employs both CCNA simulation questions and real-world inquiries concerning computer network security posed by Chinese network engineers. To ensure a thorough evaluation, diverse dimensions are considered, encompassing role dependency, cross-linguistic proficiency, and answer reproducibility, accompanied by statistical analysis. The findings demonstrate that the model performs consistently, regardless of whether prompts include a role definition or not. In addition, its adaptability across languages is confirmed by maintaining stable accuracy in both original and translated datasets. A distinct contrast emerges between its performance on lower-order factual recall tasks and higher-order reasoning exercises, which underscores its strengths in retrieving information and its limitations in complex analytical tasks. Although DeepSeek-V3 offers considerable practical value for network security education, challenges remain in its capability to process multimodal data and address highly intricate topics. These results provide valuable insights for future refinement of large language models in specialized professional environments.

80. 【2504.00420】hink Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation

链接https://arxiv.org/abs/2504.00420

作者:Yuanqi Yao,Siao Liu,Haoming Song,Delin Qu,Qizhi Chen,Yan Ding,Bin Zhao,Zhigang Wang,Xuelong Li,Dong Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:remains significantly challenging, acquisition remains significantly, effectively leverage prior, continuous skill acquisition, skill acquisition remains

备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Building a lifelong robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in alleviating catastrophic forgetting problem, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primitive Prompt Learning (PPL), to achieve lifelong robot manipulation via reusable and extensible primitives. Within our two stage learning scheme, we first learn a set of primitive prompts to represent shared primitives through multi-skills pre-training stage, where motion-aware prompts are learned to capture semantic and motion shared primitives across different skills. Secondly, when acquiring new skills in lifelong span, new prompts are appended and optimized with frozen pretrained prompts, boosting the learning via knowledge transfer from old skills to new ones. For evaluation, we construct a large-scale skill dataset and conduct extensive experiments in both simulation and real-world tasks, demonstrating PPL's superior performance over state-of-the-art methods.

81. 【2504.00410】NCAP: Scene Text Image Super-Resolution with Non-CAtegorical Prior

链接https://arxiv.org/abs/2504.00410

作者:Dongwoo Park,Suk Pil Ko

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Scene text, resolution and quality, treated scene text, text image super-resolution, Scene text image

备注: WACV 2025

点击查看摘要

Abstract:Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.

82. 【2504.00401】Beyond Wide-Angle Images: Unsupervised Video Portrait Correction via Spatiotemporal Diffusion Adaptation

链接https://arxiv.org/abs/2504.00401

作者:Wenbo Nie,Lang Nie,Chunyu Lin,Jingwen Chen,Ke Xing,Jiyuan Wang,Yao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:degrades visual appeal, lens-which degrades visual, distortion-induced facial stretching-especially, content creation, suffer from distortion-induced

备注

点击查看摘要

Abstract:Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose an image portrait correction framework using diffusion models named ImagePD. It integrates the long-range awareness of transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePD for unlabeled wide-angle videos (termed VideoPD), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePD, VideoPD maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in people number, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.

83. 【2504.00400】Adaptive Low Light Enhancement via Joint Global-Local Illumination Adjustment

链接https://arxiv.org/abs/2504.00400

作者:Haodian Wang,Yaqi Song

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:uneven ambient lighting, conditions face significant, large dynamic range, low-light conditions face, face significant challenges

备注

点击查看摘要

Abstract:Images captured under real-world low-light conditions face significant challenges due to uneven ambient lighting, making it difficult for existing end-to-end methods to enhance images with a large dynamic range to normal exposure levels. To address the above issue, we propose a novel brightness-adaptive enhancement framework designed to tackle the challenge of local exposure inconsistencies in real-world low-light images. Specifically, our proposed framework comprises two components: the Local Contrast Enhancement Network (LCEN) and the Global Illumination Guidance Network (GIGN). We introduce an early stopping mechanism in the LCEN and design a local discriminative module, which adaptively perceives the contrast of different areas in the image to control the premature termination of the enhancement process for patches with varying exposure levels. Additionally, within the GIGN, we design a global attention guidance module that effectively models global illumination by capturing long-range dependencies and contextual information within the image, which guides the local contrast enhancement network to significantly improve brightness across different regions. Finally, in order to coordinate the LCEN and GIGN, we design a novel training strategy to facilitate the training process. Experiments on multiple datasets demonstrate that our method achieves superior quantitative and qualitative results compared to state-of-the-art algorithms.

84. 【2504.00396】SPF-Portrait: Towards Pure Portrait Customization with Semantic Pollution-Free Fine-tuning

链接https://arxiv.org/abs/2504.00396

作者:Xiaole Xian,Zhichao Liao,Qingyu Li,Wenyu Qin,Pengfei Wan,Weicheng Xie,Long Zeng,Linlin Shen,Pingfa Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing methods suffer, portrait datasets enables, prevents incremental learning, datasets enables attribute, Semantic Pollution

备注

点击查看摘要

Abstract:While fine-tuning pre-trained Text-to-Image (T2I) models on portrait datasets enables attribute customization, existing methods suffer from Semantic Pollution that compromises the original model's behavior and prevents incremental learning. To address this, we propose SPF-Portrait, a pioneering work to purely understand customized semantics while eliminating semantic pollution in text-driven portrait customization. In our SPF-Portrait, we propose a dual-path pipeline that introduces the original model as a reference for the conventional fine-tuning path. Through contrastive learning, we ensure adaptation to target attributes and purposefully align other unrelated attributes with the original portrait. We introduce a novel Semantic-Aware Fine Control Map, which represents the precise response regions of the target semantics, to spatially guide the alignment process between the contrastive paths. This alignment process not only effectively preserves the performance of the original model but also avoids over-alignment. Furthermore, we propose a novel response enhancement mechanism to reinforce the performance of target attributes, while mitigating representation discrepancy inherent in direct cross-modal supervision. Extensive experiments demonstrate that SPF-Portrait achieves state-of-the-art performance.

85. 【2504.00394】AP-CAP: Advancing High-Quality Data Synthesis for Animal Pose Estimation via a Controllable Image Generation Pipeline

链接https://arxiv.org/abs/2504.00394

作者:Lei Wang,Yujie Zhong,Xiaopeng Sun,Jingchun Cheng,Chengjian Feng,Qiong Cao,Lin Ma,Zhaoxin Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image Synthesis Strategy, Animal Image Synthesis, advancing deep learning, deep learning applications, animal behavior analysis

备注

点击查看摘要

Abstract:The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.

86. 【2504.00387】Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

链接https://arxiv.org/abs/2504.00387

作者:Zilong Huang,Jun He,Junyan Ye,Lihan Jiang,Weijia Li,Yiping Chen,Ting Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:holds significant practical, significant practical importance, scenes holds significant, computer graphics, computer vision

备注: CVPR 2025, 11 pages, 7 figures

点击查看摘要

Abstract:The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at this https URL .

87. 【2504.00385】Leveraging Contrast Information for Efficient Document Shadow Removal

链接https://arxiv.org/abs/2504.00385

作者:Yifan Liu,Jiancheng Huang,Na Liu,Mingfu Yan,Yi Huang,Shifeng Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:document shadow removal, shadow removal, Document, shadow, major obstacle

备注

点击查看摘要

Abstract:Document shadows are a major obstacle in the digitization process. Due to the dense information in text and patterns covered by shadows, document shadow removal requires specialized methods. Existing document shadow removal methods, although showing some progress, still rely on additional information such as shadow masks or lack generalization and effectiveness across different shadow scenarios. This often results in incomplete shadow removal or loss of original document content and tones. Moreover, these methods tend to underutilize the information present in the original shadowed document image. In this paper, we refocus our approach on the document images themselves, which inherently contain rich this http URL propose an end-to-end document shadow removal method guided by contrast representation, following a coarse-to-fine refinement approach. By extracting document contrast information, we can effectively and quickly locate shadow shapes and positions without the need for additional masks. This information is then integrated into the refined shadow removal process, providing better guidance for network-based removal and feature fusion. Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance.

88. 【2504.00382】Intrinsic-feature-guided 3D Object Detection

链接https://arxiv.org/abs/2504.00382

作者:Wanjing Zhang,Chenxing Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving systems, essential for autonomous, driving systems, autonomous driving, detection

备注

点击查看摘要

Abstract:LiDAR-based 3D object detection is essential for autonomous driving systems. However, LiDAR point clouds may appear to have sparsity, uneven distribution, and incomplete structures, significantly limiting the detection performance. In road driving environments, target objects referring to vehicles, pedestrians and cyclists are well-suited for enhancing representation through the complete template guidance, considering their grid and topological structures. Therefore, this paper presents an intrinsic-feature-guided 3D object detection method based on a template-assisted feature enhancement module, which extracts intrinsic features from relatively generalized templates and provides rich structural information for foreground objects. Furthermore, a proposal-level contrastive learning mechanism is designed to enhance the feature differences between foreground and background objects. The proposed modules can act as plug-and-play components and improve the performance of multiple existing methods. Extensive experiments illustrate that the proposed method achieves the highly competitive detection results. Code will be available at this https URL.

89. 【2504.00380】Hierarchical Flow Diffusion for Efficient Frame Interpolation

链接https://arxiv.org/abs/2504.00380

作者:Yang Hai,Guo Wang,Tan Su,Wenjie Jiang,Yinlin Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video frame interpolation, large gap compared, frame interpolation, gap compared, compared to non-diffusion

备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods. The project page is at: this https URL.

90. 【2504.00379】MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

链接https://arxiv.org/abs/2504.00379

作者:Zhiyuan Zhang,Xiaofan Li,Zhihao Xu,Wenjie Peng,Zijian Zhou,Miaojing Shi,Shuangping Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous driving visual, answer questions related, Autonomous driving, visual question answering, driving visual question

备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM's spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.

91. 【2504.00375】CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection

链接https://arxiv.org/abs/2504.00375

作者:Xin Zhang,Keren Fu,Qijun Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prompt-guided video foundation, drawing significant attention, video foundation model, video object segmentation, prompt-guided video

备注: 10 pages, 5 figures,

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompting frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.

92. 【2504.00370】Spatiotemporal Attention Learning Framework for Event-Driven Object Recognition

链接https://arxiv.org/abs/2504.00370

作者:Tiantian Xie,Pengpai Wang,Rosa H. M. Chan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:biological neural systems, asynchronously capture local, capture local pixel-level, local pixel-level intensity, sparse event stream

备注: 2025 IEEE NSENS

点击查看摘要

Abstract:Event-based vision sensors, inspired by biological neural systems, asynchronously capture local pixel-level intensity changes as a sparse event stream containing position, polarity, and timestamp information. These neuromorphic sensors offer significant advantages in dynamic range, latency, and power efficiency. Their working principle inherently addresses traditional camera limitations such as motion blur and redundant background information, making them particularly suitable for dynamic vision tasks. While recent works have proposed increasingly complex event-based architectures, the computational overhead and parameter complexity of these approaches limit their practical deployment. This paper presents a novel spatiotemporal learning framework for event-based object recognition, utilizing a VGG network enhanced with Convolutional Block Attention Module (CBAM). Our approach achieves comparable performance to state-of-the-art ResNet-based methods while reducing parameter count by 2.3% compared to the original VGG model. Specifically, it outperforms ResNet-based methods like MVF-Net, achieving the highest Top-1 accuracy of 76.4% (pretrained) and 71.3% (not pretrained) on CIFAR10-DVS, and 72.4% (not pretrained) on N-Caltech101. These results highlight the robustness of our method when pretrained weights are not used, making it suitable for scenarios where transfer learning is unavailable. Moreover, our approach reduces reliance on data augmentation. Experimental results on standard event-based datasets demonstrate the framework's efficiency and effectiveness for real-world applications.

93. 【2504.00356】Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

链接https://arxiv.org/abs/2504.00356

作者:Ting Liu,Siyuan Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made substantial progress, Recent advances, progress in aligning, aligning visual, visual and textual

备注: accepted to CVPR2025

点击查看摘要

Abstract:Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at this https URL .

94. 【2504.00348】ransductive One-Shot Learning Meet Subspace Decomposition

链接https://arxiv.org/abs/2504.00348

作者:Kyle Stein,Andrew A. Mahyari,Guillermo Francia III,Eman El-Sheikh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:adapting pretrained models, recognize newly introduced, One-shot learning focuses, unseen classes based, One-shot learning

备注

点击查看摘要

Abstract:One-shot learning focuses on adapting pretrained models to recognize newly introduced and unseen classes based on a single labeled image. While variations of few-shot and zero-shot learning exist, one-shot learning remains a challenging yet crucial problem due to its ability to generalize knowledge to unseen classes from just one human-annotated image. In this paper, we introduce a transductive one-shot learning approach that employs subspace decomposition to utilize the information from labeled images in the support set and unlabeled images in the query set. These images are decomposed into a linear combination of latent variables representing primitives captured by smaller subspaces. By representing images in the query set as linear combinations of these latent primitives, we can propagate the label from a single image in the support set to query images that share similar combinations of primitives. Through a comprehensive quantitative analysis across various neural network feature extractors and datasets, we demonstrate that our approach can effectively generalize to novel classes from just one labeled image.

95. 【2504.00270】NeRF-Based defect detection

链接https://arxiv.org/abs/2504.00270

作者:Tianqi(Kirk)Ding,Dawei Xiang,Yijiashun Qi,Ze Yang,Zunduo Zhao,Tianyao Sun,Pengbin Feng,Haoyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, rapid growth, automation has highlighted, defect detection, large-scale machinery

备注: 6 pages, 11 figures, 2025 2nd International Conference on Remote Sensing, Mapping and Image Processing (RSMIP 2025)

点击查看摘要

Abstract:The rapid growth of industrial automation has highlighted the need for precise and efficient defect detection in large-scale machinery. Traditional inspection techniques, involving manual procedures such as scaling tall structures for visual evaluation, are labor-intensive, subjective, and often hazardous. To overcome these challenges, this paper introduces an automated defect detection framework built on Neural Radiance Fields (NeRF) and the concept of digital twins. The system utilizes UAVs to capture images and reconstruct 3D models of machinery, producing both a standard reference model and a current-state model for comparison. Alignment of the models is achieved through the Iterative Closest Point (ICP) algorithm, enabling precise point cloud analysis to detect deviations that signify potential defects. By eliminating manual inspection, this method improves accuracy, enhances operational safety, and offers a scalable solution for defect detection. The proposed approach demonstrates great promise for reliable and efficient industrial applications.

96. 【2504.00254】ElaLoRA: Elastic Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

链接https://arxiv.org/abs/2504.00254

作者:Huandong Chang,Zicheng Ma,Mingyuan Ma,Zhenting Qi,Andrew Sabot,Hong Jiang,H. T. Kung

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:widely adopted technique, minimal parameter updates, large-scale pre-trained models, fine-tuning large-scale pre-trained, widely adopted

备注

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.

97. 【2504.00247】MultiMorph: On-demand Atlas Construction

链接https://arxiv.org/abs/2504.00247

作者:S. Mazdak Abulnaga,Andrew Hoopes,Neel Dey,Malte Hoffmann,Marianne Rakic,Bruce Fischl,John Guttag,Adrian Dalca

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:constructing anatomical atlases, fast and efficient, constructing anatomical, efficient method, MultiMorph

备注: accepted to CVPR 2025

点击查看摘要

Abstract:We present MultiMorph, a fast and efficient method for constructing anatomical atlases on the fly. Atlases capture the canonical structure of a collection of images and are essential for quantifying anatomical variability across populations. However, current atlas construction methods often require days to weeks of computation, thereby discouraging rapid experimentation. As a result, many scientific studies rely on suboptimal, precomputed atlases from mismatched populations, negatively impacting downstream analyses. MultiMorph addresses these challenges with a feedforward model that rapidly produces high-quality, population-specific atlases in a single forward pass for any 3D brain dataset, without any fine-tuning or optimization. MultiMorph is based on a linear group-interaction layer that aggregates and shares features within the group of input images. Further, by leveraging auxiliary synthetic data, MultiMorph generalizes to new imaging modalities and population groups at test-time. Experimentally, MultiMorph outperforms state-of-the-art optimization-based and learning-based atlas construction methods in both small and large population settings, with a 100-fold reduction in time. This makes MultiMorph an accessible framework for biomedical researchers without machine learning expertise, enabling rapid, high-quality atlas generation for diverse studies.

98. 【2504.00234】CBIL: Collective Behavior Imitation Learning for Fish from Real Videos

链接https://arxiv.org/abs/2504.00234

作者:Yifan Wu,Zhiyang Dou,Yuko Ishiwaka,Shun Ogawa,Yuke Lou,Wenping Wang,Lingjie Liu,Taku Komura

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reproducing realistic collective, Reproducing realistic, realistic collective behaviors, formidable challenge, imitation learning

备注

点击查看摘要

Abstract:Reproducing realistic collective behaviors presents a captivating yet formidable challenge. Traditional rule-based methods rely on hand-crafted principles, limiting motion diversity and realism in generated collective behaviors. Recent imitation learning methods learn from data but often require ground truth motion trajectories and struggle with authenticity, especially in high-density groups with erratic movements. In this paper, we present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos, without relying on captured motion trajectories. Our method first leverages Video Representation Learning, where a Masked Video AutoEncoder (MVAE) extracts implicit states from video inputs in a self-supervised manner. The MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage. Then, we propose a novel adversarial imitation learning method to effectively capture complex movements of the schools of fish, allowing for efficient imitation of the distribution for motion patterns measured in the latent space. It also incorporates bio-inspired rewards alongside priors to regularize and stabilize training. Once trained, CBIL can be used for various animation tasks with the learned collective motion priors. We further show its effectiveness across different species. Finally, we demonstrate the application of our system in detecting abnormal fish behavior from in-the-wild videos.

99. 【2504.00221】GazeLLM: Multimodal LLMs incorporating Human Visual Attention

链接https://arxiv.org/abs/2504.00221

作者:Jun Rekimoto

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Language Models, Multimodal LLMs, advancing into Multimodal, Large Language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are advancing into Multimodal LLMs (MLLMs), capable of processing image, audio, and video as well as text. Combining first-person video, MLLMs show promising potential for understanding human activities through video and audio, enabling many human-computer interaction and human-augmentation applications such as human activity support, real-world agents, and skill transfer to robots or other individuals. However, handling high-resolution, long-duration videos generates large latent representations, leading to substantial memory and processing demands, limiting the length and resolution MLLMs can manage. Reducing video resolution can lower memory usage but often compromises comprehension. This paper introduces a method that optimizes first-person video analysis by integrating eye-tracking data, and proposes a method that decomposes first-person vision video into sub areas for regions of gaze focus. By processing these selectively gazed-focused inputs, our approach achieves task comprehension equivalent to or even better than processing the entire image at full resolution, but with significantly reduced video data input (reduce the number of pixels to one-tenth), offering an efficient solution for using MLLMs to interpret and utilize human skills.

100. 【2504.00220】Can Diffusion Models Disentangle? A Theoretical Perspective

链接https://arxiv.org/abs/2504.00220

作者:Liming Wang,Muhammad Jehanzeb Mirza,Yishu Gong,Yuan Gong,Jiaqi Zhang,Brian H. Tracey,Katerina Placek,Marco Vilela,James R. Glass

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:learn disentangled representations, paper presents, understanding how diffusion, disentangled representations, learn disentangled

备注

点击查看摘要

Abstract:This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.

101. 【2504.00219】LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors

链接https://arxiv.org/abs/2504.00219

作者:Han Zhou,Wei Dong,Jun Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Combining existing exposure, adverse illumination conditions, adverse illumination scenarios, illumination conditions exhibits, illumination scenarios fail

备注: Accepted by CVPR 2025. 3DGS, Adverse illumination conditions, Reference-free, Physical priors

点击查看摘要

Abstract:Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality, normally-exposed representations due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, significant noise, and color distortion pose substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS does not achieve satisfactory performance due to their individual enhancement processes, which lead to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose LITA-GS, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively mitigate the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpasses the state-of-the-art (SOTA) NeRF-based method while enjoying faster inference speed and costing reduced training time. The code is released at this https URL.

102. 【2504.00204】RailGoerl24: Görlitz Rail Test Center CV Dataset 2024

链接https://arxiv.org/abs/2504.00204

作者:Rustam Tagiew(1),Ilkay Wunderlich(2),Mark Sastuba(1),Steffen Seitz(3) ((1) German Centre for Rail Traffic Research at the Federal Railway Authority, (2) EYYES GmbH, (3) Conrad Zuse School of Embedded Composite AI and the Chair of Fundamentals of Electrical Engineering of Dresden University of Technology)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:things automatic detection, Driverless train operation, potential obstacles, mainline railways requires, open tracks

备注: 4 pages, 5 figures, submitted to Engineering Reliable Autonomous Systems 2025

点击查看摘要

Abstract:Driverless train operation for open tracks on urban guided transport and mainline railways requires, among other things automatic detection of actual and potential obstacles, especially humans, in the danger zone of the train's path. Machine learning algorithms have proven to be powerful state-of-the-art tools for this task. However, these algorithms require large amounts of high-quality annotated data containing human beings in railway-specific environments as training data. Unfortunately, the amount of publicly available datasets is not yet sufficient and is significantly inferior to the datasets in the road domain. Therefore, this paper presents RailGoerl24, an on-board visual light Full HD camera dataset of 12205 frames recorded in a railway test center of TÜV SÜD Rail, in Görlitz, Germany. Its main purpose is to support the development of driverless train operation for guided transport. RailGoerl24 also includes a terrestrial LiDAR scan covering parts of the area used to acquire the RGB data. In addition to the raw data, the dataset contains 33556 boxwise annotations in total for the object class 'person'. The faces of recorded actors are not blurred or altered in any other way. RailGoerl24, soon available at this http URL, can also be used for tasks beyond collision prediction.

103. 【2504.00200】SmartScan: An AI-based Interactive Framework for Automated Region Extraction from Satellite Images

链接https://arxiv.org/abs/2504.00200

作者:Savinay Nagendra,Kashif Rashid

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:continuous methane monitoring, methane monitoring system, monitoring system requires, system requires determining, continuous methane

备注

点击查看摘要

Abstract:The deployment of a continuous methane monitoring system requires determining the optimal number and placement of fixed sensors. However, planning is labor-intensive, requiring extensive site setup and iteration to meet client restrictions. This challenge is amplified when evaluating multiple sites, limiting scalability. To address this, we introduce SmartScan, an AI framework that automates data extraction for optimal sensor placement. SmartScan identifies subspaces of interest from satellite images using an interactive tool to create facility-specific constraint sets efficiently. SmartScan leverages the Segment Anything Model (SAM), a prompt-based transformer for zero-shot segmentation, enabling subspace extraction without explicit training. It operates in two modes: (1) Data Curation Mode, where satellite images are processed to extract high-quality subspaces using an interactive prompting system for SAM, and (2) Autonomous Mode, where user-curated prompts train a deep learning network to replace manual prompting, fully automating subspace extraction. The interactive tool also serves for quality control, allowing users to refine AI-generated outputs and generate additional constraint sets as needed. With its AI-driven prompting mechanism, SmartScan delivers high-throughput, high-quality subspace extraction with minimal human intervention, enhancing scalability and efficiency. Notably, its adaptable design makes it suitable for extracting regions of interest from ultra-high-resolution satellite imagery across various domains.

104. 【2504.00191】Leveraging Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography

链接https://arxiv.org/abs/2504.00191

作者:Lin Zhao,Xin Yu,Yikang Liu,Xiao Chen,Eric Z. Chen,Terrence Chen,Shanhui Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:coronary artery disease, coronary artery structures, Accurate correspondence matching, Computed Tomography Angiography, artery disease

备注

点击查看摘要

Abstract:Accurate correspondence matching in coronary angiography images is crucial for reconstructing 3D coronary artery structures, which is essential for precise diagnosis and treatment planning of coronary artery disease (CAD). Traditional matching methods for natural images often fail to generalize to X-ray images due to inherent differences such as lack of texture, lower contrast, and overlapping structures, compounded by insufficient training data. To address these challenges, we propose a novel pipeline that generates realistic paired coronary angiography images using a diffusion model conditioned on 2D projections of 3D reconstructed meshes from Coronary Computed Tomography Angiography (CCTA), providing high-quality synthetic data for training. Additionally, we employ large-scale image foundation models to guide feature aggregation, enhancing correspondence matching accuracy by focusing on semantically relevant regions and keypoints. Our approach demonstrates superior matching performance on synthetic datasets and effectively generalizes to real-world datasets, offering a practical solution for this task. Furthermore, our work investigates the efficacy of different foundation models in correspondence matching, providing novel insights into leveraging advanced image foundation models for medical imaging applications.

105. 【2504.00185】Self-Evolving Visual Concept Library using Vision-Language Critics

链接https://arxiv.org/abs/2504.00185

作者:Atharva Sehgal,Patrick Yuan,Ziniu Hu,Yisong Yue,Jennifer J. Sun,Swarat Chaudhuri

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:concept library, study the problem, ESCHER, concept, visual

备注: CVPR camera ready

点击查看摘要

Abstract:We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.

106. 【2504.00161】SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

链接https://arxiv.org/abs/2504.00161

作者:Suzanne Stathatos,Michael Hobley,Markus Marks,Pietro Perona

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Foundation models excel, Foundation models, fail in low, underwater sonar, introduce Spatiotemporal Augmentations

备注: Project page: [this https URL](https://suzanne-stathatos.github.io/SAVeD) Code page: [this https URL](https://github.com/suzanne-stathatos/SAVeD)

点击查看摘要

Abstract:Foundation models excel at vision tasks in natural images but fail in low signal-to-noise ratio (SNR) videos, such as underwater sonar, ultrasound, and microscopy. We introduce Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging differences in foreground and background motion, SAVeD enhances object visibility using an encoder-decoder with a temporal bottleneck. Our approach improves classification, detection, tracking, and counting, outperforming state-of-the-art video denoising methods with lower resource requirements. Project page: this https URL Code page: this https URL

107. 【2504.00159】SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting

链接https://arxiv.org/abs/2504.00159

作者:Advaith V. Sethuraman,Max Rucker,Onur Bagoren,Pou-Chun Kung,Nibarkavi N.B. Amutha,Katherine A. Skinner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian splatting framework, acoustic streaking phenomena, Gaussian splatting, realistic novel view, streaking phenomena

备注

点击查看摘要

Abstract:In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize learned Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+2.5 dB PSNR). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal and 3D scene reconstruction.

108. 【2504.00150】Few-Shot Generation of Brain Tumors for Secure and Fair Data Sharing

链接https://arxiv.org/abs/2504.00150

作者:Yongyi Shi,Ge Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Leveraging multi-center data, analytics presents challenges, presents challenges due, Leveraging multi-center, medical analytics presents

备注: 17 pages, 4 figures

点击查看摘要

Abstract:Leveraging multi-center data for medical analytics presents challenges due to privacy concerns and data heterogeneity. While distributed approaches such as federated learning has gained traction, they remain vulnerable to privacy breaches, particularly in sensitive domains like medical imaging. Generative models, such as diffusion models, enhance privacy by synthesizing realistic data. However, they are prone to memorization, especially when trained on small datasets. This study proposes a decentralized few-shot generative model (DFGM) to synthesize brain tumor images while fully preserving privacy. DFGM harmonizes private tumor data with publicly shareable healthy images from multiple medical centers, constructing a new dataset by blending tumor foregrounds with healthy backgrounds. This approach ensures stringent privacy protection and enables controllable, high-quality synthesis by preserving both the healthy backgrounds and tumor foregrounds. We assess DFGM's effectiveness in brain tumor segmentation using a UNet, achieving Dice score improvements of 3.9% for data augmentation and 4.6% for fairness on a separate dataset.

109. 【2504.00149】owards Precise Action Spotting: Addressing Temporal Misalignment in Labels with Dynamic Label Assignment

链接https://arxiv.org/abs/2504.00149

作者:Masato Tamura

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:attracted considerable attention, considerable attention due, Precise action spotting, Precise action, promising applications

备注

点击查看摘要

Abstract:Precise action spotting has attracted considerable attention due to its promising applications. While existing methods achieve substantial performance by employing well-designed model architecture, they overlook a significant challenge: the temporal misalignment inherent in ground-truth labels. This misalignment arises when frames labeled as containing events do not align accurately with the actual event times, often as a result of human annotation errors or the inherent difficulties in precisely identifying event boundaries across neighboring frames. To tackle this issue, we propose a novel dynamic label assignment strategy that allows predictions to have temporal offsets from ground-truth action times during training, ensuring consistent event spotting. Our method extends the concept of minimum-cost matching, which is utilized in the spatial domain for object detection, to the temporal domain. By calculating matching costs based on predicted action class scores and temporal offsets, our method dynamically assigns labels to the most likely predictions, even when the predicted times of these predictions deviate from ground-truth times, alleviating the negative effects of temporal misalignment in labels. We conduct extensive experiments and demonstrate that our method achieves state-of-the-art performance, particularly in conditions where events are visually distinct and temporal misalignment in labels is common.

110. 【2504.00139】SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection

链接https://arxiv.org/abs/2504.00139

作者:Yannick Burkhardt,Simon Schaefer,Stefan Leutenegger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly optimized Visual, Visual SLAM systems, optimized Visual SLAM, holds significant potential, matching holds significant

备注: In Review for ICCV25

点击查看摘要

Abstract:Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. Source code and multimedia material are available at this http URL.

111. 【2504.00072】Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

链接https://arxiv.org/abs/2504.00072

作者:Lucas Ventura,Antoine Yang,Cordelia Schmid,Gül Varol

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:address the task, timeline into semantic, semantic units, units and generating, long video timeline

备注: CVPR 2025 Camera ready. Project page: [this https URL](https://imagine.enpc.fr/~lucas.ventura/chapter-llama/)

点击查看摘要

Abstract:We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

112. 【2504.00060】CF-CAM: Gradient Perturbation Mitigation and Feature Stabilization for Reliable Interpretability

链接https://arxiv.org/abs/2504.00060

作者:Hongjie He,Xu Pan,Yudong Yao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:network decision-making remains, Class Activation Mapping, deep learning continues, continues to advance, limiting trust

备注

点击查看摘要

Abstract:As deep learning continues to advance, the opacity of neural network decision-making remains a critical challenge, limiting trust and applicability in high-stakes domains. Class Activation Mapping (CAM) techniques have emerged as a key approach to visualizing model decisions, yet existing methods face inherent trade-offs. Gradient-based CAM variants suffer from sensitivity to gradient perturbations, leading to unstable and unreliable explanations. Conversely, gradient-free approaches mitigate gradient instability but incur significant computational overhead and inference latency. To address these limitations, we propose Cluster Filter Class Activation Map (CF-CAM), a novel framework that reintroduces gradient-based weighting while enhancing robustness against gradient noise. CF-CAM employs a hierarchical importance weighting strategy to balance discriminative feature preservation and noise elimination. A density-aware channel clustering via Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups semantically relevant feature channels and discard noise-prone activations. Additionally, cluster-conditioned gradient filtering leverages bilateral filters to refine gradient signals, preserving edge-aware localization while suppressing noise impact. Experiment results demonstrate that CF-CAM achieves superior interpretability performance while maintaining resilience to gradient perturbations, outperforming state-of-the-art CAM methods in faithfulness and robustness. By effectively mitigating gradient instability without excessive computational cost, CF-CAM provides a reliable solution for enhancing the interpretability of deep neural networks in critical applications such as medical diagnosis and autonomous driving.

113. 【2504.00043】CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

链接https://arxiv.org/abs/2504.00043

作者:Jixuan Leng,Chengsong Huang,Langlin Huang,Bill Yuchen Lin,William W. Cohen,Haohan Wang,Jiaxin Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Vision-Language Models, Large Language, limited dynamic interplay, vision-language understanding capabilities

备注

点击查看摘要

Abstract:Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

114. 【2504.00037】ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

链接https://arxiv.org/abs/2504.00037

作者:Guoyizhe Wei,Rama Chellappa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:delivered remarkable progress, Vision Transformers, delivered remarkable, remarkable progress, progress through global

备注

点击查看摘要

Abstract:Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.

115. 【2504.00032】Skeletonization Quality Evaluation: Geometric Metrics for Point Cloud Analysis in Robotics

链接https://arxiv.org/abs/2504.00032

作者:Qingmeng Wen,Yu-Kun Lai,Ze Ji,Seyed Amir Tafrishi

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Robotics (cs.RO)

关键词:inherent instinct, instinct to understand, shape analysis, object morphology, powerful tool

备注: 15 pages, 12 figures, under-review

点击查看摘要

Abstract:Skeletonization is a powerful tool for shape analysis, rooted in the inherent instinct to understand an object's morphology. It has found applications across various domains, including robotics. Although skeletonization algorithms have been studied in recent years, their performance is rarely quantified with detailed numerical evaluations. This work focuses on defining and quantifying geometric properties to systematically score the skeletonization results of point cloud shapes across multiple aspects, including topological similarity, boundedness, centeredness, and smoothness. We introduce these representative metric definitions along with a numerical scoring framework to analyze skeletonization outcomes concerning point cloud data for different scenarios, from object manipulation to mobile robot navigation. Additionally, we provide an open-source tool to enable the research community to evaluate and refine their skeleton models. Finally, we assess the performance and sensitivity of the proposed geometric evaluation methods from various robotic applications.

116. 【2504.00023】A Novel Distance-Based Metric for Quality Assessment in Image Segmentation

链接https://arxiv.org/abs/2504.00023

作者:Niklas Rottmayer,Claudia Redenbach

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:segmentation quality plays, range of applications, plays a fundamental, fundamental role, wide range

备注

点击查看摘要

Abstract:The assessment of segmentation quality plays a fundamental role in the development, optimization, and comparison of segmentation methods which are used in a wide range of applications. With few exceptions, quality assessment is performed using traditional metrics, which are based on counting the number of erroneous pixels but do not capture the spatial distribution of errors. Established distance-based metrics such as the average Hausdorff distance are difficult to interpret and compare for different methods and datasets. In this paper, we introduce the Surface Consistency Coefficient (SCC), a novel distance-based quality metric that quantifies the spatial distribution of errors based on their proximity to the surface of the structure. Through a rigorous analysis using synthetic data and real segmentation results, we demonstrate the robustness and effectiveness of SCC in distinguishing errors near the surface from those further away. At the same time, SCC is easy to interpret and comparable across different structural contexts.

117. 【2504.00017】Enhance Vision-based Tactile Sensors via Dynamic Illumination and Image Fusion

链接https://arxiv.org/abs/2504.00017

作者:Artemii Redkin,Zdravko Dugonjic,Mike Lambeta,Roberto Calandra

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Vision-based tactile sensors, Vision-based tactile, tactile sensors, elastomeric interface, structured light

备注: 8 pages

点击查看摘要

Abstract:Vision-based tactile sensors use structured light to measure deformation in their elastomeric interface. Until now, vision-based tactile sensors such as DIGIT and GelSight have been using a single, static pattern of structured light tuned to the specific form factor of the sensor. In this work, we investigate the effectiveness of dynamic illumination patterns, in conjunction with image fusion techniques, to improve the quality of sensing of vision-based tactile sensors. Specifically, we propose to capture multiple measurements, each with a different illumination pattern, and then fuse them together to obtain a single, higher-quality measurement. Experimental results demonstrate that this type of dynamic illumination yields significant improvements in image contrast, sharpness, and background difference. This discovery opens the possibility of retroactively improving the sensing quality of existing vision-based tactile sensors with a simple software update, and for new hardware designs capable of fully exploiting dynamic illumination.

118. 【2503.24388】RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

链接https://arxiv.org/abs/2503.24388

作者:Zhonghan Zhao,Wenwei Zhang,Haian Huang,Kuikun Liu,Jianfei Gao,Gaoang Wang,Kai Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:embodied agents operating, complex open-world environments, essential for embodied, operating in complex, complex open-world

备注

点击查看摘要

Abstract:Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

119. 【2503.22516】Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery

链接https://arxiv.org/abs/2503.22516

作者:Samira Alkaee Taleghan,Morteza Karimzadeh,Andrew P. Barrett,Walter N. Meier,Farnoush Banaei-Kashani

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:sea ice, sea ice segmentation, sea ice conditions, polar climate processes, Accurate segmentation

备注

点击查看摘要

Abstract:Accurate segmentation of sea ice types is essential for mapping and operational forecasting of sea ice conditions for safe navigation and resource extraction in ice-covered waters, as well as for understanding polar climate processes. While deep learning methods have shown promise in automating sea ice segmentation, they often rely on extensive labeled datasets which require expert knowledge and are time-consuming to create. Recently, foundation models (FMs) have shown excellent results for segmenting remote sensing images by utilizing pre-training on large datasets using self-supervised techniques. However, their effectiveness for sea ice segmentation remains unexplored, especially given sea ice's complex structures, seasonal changes, and unique spectral signatures, as well as peculiar Synthetic Aperture Radar (SAR) imagery characteristics including banding and scalloping noise, and varying ice backscatter characteristics, which are often missing in standard remote sensing pre-training datasets. In particular, SAR images over polar regions are acquired using different modes than used to capture the images at lower latitudes by the same sensors that form training datasets for FMs. This study evaluates ten remote sensing FMs for sea ice type segmentation using Sentinel-1 SAR imagery, focusing on their seasonal and spatial generalization. Among the selected models, Prithvi-600M outperforms the baseline models, while CROMA achieves a very similar performance in F1-score. Our contributions include offering a systematic methodology for selecting FMs for sea ice data analysis, a comprehensive benchmarking study on performances of FMs for sea ice segmentation with tailored performance metrics, and insights into existing gaps and future directions for improving domain-specific models in polar applications using SAR data.

120. 【2504.00702】Orientation Scores should be a Piece of Cake

链接https://arxiv.org/abs/2504.00702

作者:Finn M. Sherry,Chase van de Geijn,Erik J. Bekkers,Remco Duits

类目:Differential Geometry (math.DG); Computer Vision and Pattern Recognition (cs.CV)

关键词:fast reconstruction property, minimise position-orientation uncertainty, position space, orientation space, orientation score

备注: Submitted to the 7th International Conference on Geometric Science of Information

点击查看摘要

Abstract:We axiomatically derive a family of wavelets for an orientation score, lifting from position space $\mathbb{R}^2$ to position and orientation space $\mathbb{R}^2\times S^1$, with fast reconstruction property, that minimise position-orientation uncertainty. We subsequently show that these minimum uncertainty states are well-approximated by cake wavelets: for standard parameters, the uncertainty gap of cake wavelets is less than 1.1, and in the limit, we prove the uncertainty gap tends to the minimum of 1. Next, we complete a previous theoretical argument that one does not have to train the lifting layer in (PDE-)G-CNNs, but can instead use cake wavelets. Finally, we show experimentally that in this way we can reduce the network complexity and improve the interpretability of (PDE-)G-CNNs, with only a slight impact on the model's performance.

121. 【2504.00302】Deconver: A Deconvolutional Network for Medical Image Segmentation

链接https://arxiv.org/abs/2504.00302

作者:Pooya Ashtari,Shahryar Noei,Fateme Nateghi Haredasht,Jonathan H. Chen,Giuseppe Jurman,Aleksandra Pizurica,Sabine Van Huffel

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:face inherent limitations, local receptive fields, convolutional neural networks, advanced medical image, high computational complexity

备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:While convolutional neural networks (CNNs) and vision transformers (ViTs) have advanced medical image segmentation, they face inherent limitations such as local receptive fields in CNNs and high computational complexity in ViTs. This paper introduces Deconver, a novel network that integrates traditional deconvolution techniques from image restoration as a core learnable component within a U-shaped architecture. Deconver replaces computationally expensive attention mechanisms with efficient nonnegative deconvolution (NDC) operations, enabling the restoration of high-frequency details while suppressing artifacts. Key innovations include a backpropagation-friendly NDC layer based on a provably monotonic update rule and a parameter-efficient design. Evaluated across four datasets (ISLES'22, BraTS'23, GlaS, FIVES) covering both 2D and 3D segmentation tasks, Deconver achieves state-of-the-art performance in Dice scores and Hausdorff distance while reducing computational costs (FLOPs) by up to 90% compared to leading baselines. By bridging traditional image restoration with deep learning, this work offers a practical solution for high-precision segmentation in resource-constrained clinical workflows. The project is available at this https URL.

122. 【2504.00264】DiffDenoise: Self-Supervised Medical Image Denoising with Conditional Diffusion Models

链接https://arxiv.org/abs/2504.00264

作者:Basar Demir,Yikang Liu,Xiao Chen,Eric Z. Chen,Lin Zhao,Boris Mailhe,Terrence Chen,Shanhui Sun

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:recent years, proposed in recent, self-supervised denoising approaches, images, self-supervised denoising

备注

点击查看摘要

Abstract:Many self-supervised denoising approaches have been proposed in recent years. However, these methods tend to overly smooth images, resulting in the loss of fine structures that are essential for medical applications. In this paper, we propose DiffDenoise, a powerful self-supervised denoising approach tailored for medical images, designed to preserve high-frequency details. Our approach comprises three stages. First, we train a diffusion model on noisy images, using the outputs of a pretrained Blind-Spot Network as conditioning inputs. Next, we introduce a novel stabilized reverse sampling technique, which generates clean images by averaging diffusion sampling outputs initialized with a pair of symmetric noises. Finally, we train a supervised denoising network using noisy images paired with the denoised outputs generated by the diffusion model. Our results demonstrate that DiffDenoise outperforms existing state-of-the-art methods in both synthetic and real-world medical image denoising tasks. We provide both a theoretical foundation and practical insights, demonstrating the method's effectiveness across various medical imaging modalities and anatomical structures.

123. 【2504.00189】Detecting Glioma, Meningioma, and Pituitary Tumors, and Normal Brain Tissues based on Yolov11 and Yolov8 Deep Learning Models

链接https://arxiv.org/abs/2504.00189

作者:Ahmed M. Taha,Salah A. Aly,Mohamed F. Darwish

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:optimal treatment planning, normal brain tissue, Accurate and quick, brain tissue Glioma, pituitary brain tumors

备注: 6 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Accurate and quick diagnosis of normal brain tissue Glioma, Meningioma, and Pituitary Tumors is crucial for optimal treatment planning and improved medical results. Magnetic Resonance Imaging (MRI) is widely used as a non-invasive diagnostic tool for detecting brain abnormalities, including tumors. However, manual interpretation of MRI scans is often time-consuming, prone to human error, and dependent on highly specialized expertise. This paper proposes an advanced AI-driven technique to detecting glioma, meningioma, and pituitary brain tumors using YoloV11 and YoloV8 deep learning models. Methods: Using a transfer learning-based fine-tuning approach, we integrate cutting-edge deep learning techniques with medical imaging to classify brain tumors into four categories: No-Tumor, Glioma, Meningioma, and Pituitary Tumors. Results: The study utilizes the publicly accessible CE-MRI Figshare dataset and involves fine-tuning pre-trained models YoloV8 and YoloV11 of 99.49% and 99.56% accuracies; and customized CNN accuracy of 96.98%. The results validate the potential of CNNs in achieving high precision in brain tumor detection and classification, highlighting their transformative role in medical imaging and diagnostics.

Comments:
6 pages, 7 figures, 8 tables

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2504.00189 [eess.IV]

(or
arXiv:2504.00189v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2504.00189

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
124. 【2504.00047】EAP4EMSIG -- Enhancing Event-Driven Microscopy for Microfluidic Single-Cell Analysis

链接https://arxiv.org/abs/2504.00047

作者:Nils Friederich,Angelo Jovin Yamachui Sitcheu,Annika Nassal,Erenus Yildiz,Matthias Pesch,Maximilian Beichter,Lukas Scholtes,Bahar Akbaba,Thomas Lautenschlager,Oliver Neumann,Dietrich Kohlheyer,Hanno Scharr,Johannes Seiffarth,Katharina Nöh,Ralf Mikut

类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Live-Cell Imaging yields, Microfluidic Live-Cell Imaging, microbial cell factories, Imaging yields data, Live-Cell Imaging

备注: Submitted to: at - Automatisierungstechnik

点击查看摘要

Abstract:Microfluidic Live-Cell Imaging yields data on microbial cell factories. However, continuous acquisition is challenging as high-throughput experiments often lack realtime insights, delaying responses to stochastic events. We introduce three components in the Experiment Automation Pipeline for Event-Driven Microscopy to Smart Microfluidic Single-Cell Analysis: a fast, accurate Deep Learning autofocusing method predicting the focus offset, an evaluation of real-time segmentation methods and a realtime data analysis dashboard. Our autofocusing achieves a Mean Absolute Error of 0.0226\textmu m with inference times below 50~ms. Among eleven Deep Learning segmentation methods, Cellpose~3 reached a Panoptic Quality of 93.58\%, while a distance-based method is fastest (121~ms, Panoptic Quality 93.02\%). All six Deep Learning Foundation Models were unsuitable for real-time segmentation.

125. 【2504.00026】Diffusion models applied to skin and oral cancer classification

链接https://arxiv.org/abs/2504.00026

作者:José J. M. Uliana,Renato A. Krohling

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Neural Networks, Convolutional Neural, oral lesions, diffusion models

备注

点击查看摘要

Abstract:This study investigates the application of diffusion models in medical image classification (DiffMIC), focusing on skin and oral lesions. Utilizing the datasets PAD-UFES-20 for skin cancer and P-NDB-UFES for oral cancer, the diffusion model demonstrated competitive performance compared to state-of-the-art deep learning models like Convolutional Neural Networks (CNNs) and Transformers. Specifically, for the PAD-UFES-20 dataset, the model achieved a balanced accuracy of 0.6457 for six-class classification and 0.8357 for binary classification (cancer vs. non-cancer). For the P-NDB-UFES dataset, it attained a balanced accuracy of 0.9050. These results suggest that diffusion models are viable models for classifying medical images of skin and oral lesions. In addition, we investigate the robustness of the model trained on PAD-UFES-20 for skin cancer but tested on the clinical images of the HIBA dataset.

126. 【2504.00022】Autonomous AI for Multi-Pathology Detection in Chest X-Rays: A Multi-Site Study in the Indian Healthcare System

链接https://arxiv.org/abs/2504.00022

作者:Bargava Subramanian,Shajeev Jaikumar,Praveen Shastry,Naveen Kumarasami,Kalyan Sivasailam,Anandakumar D,Keerthana R,Mounigasri M,Kishore Prasath Venkatesh

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Study Design, study design includes, chest X-ray, outlines the development, vast dataset

备注: 27 pages , 8 figures

点击查看摘要

Abstract:Study Design: The study outlines the development of an autonomous AI system for chest X-ray (CXR) interpretation, trained on a vast dataset of over 5 million X rays sourced from healthcare systems across India. This AI system integrates advanced architectures including Vision Transformers, Faster R-CNN, and various U Net models (such as Attention U-Net, U-Net++, and Dense U-Net) to enable comprehensive classification, detection, and segmentation of 75 distinct pathologies. To ensure robustness, the study design includes subgroup analyses across age, gender, and equipment type, validating the model's adaptability and performance across diverse patient demographics and imaging environments. Performance: The AI system achieved up to 98% precision and over 95% recall for multi pathology classification, with stable performance across demographic and equipment subgroups. For normal vs. abnormal classification, it reached 99.8% precision, 99.6% recall, and 99.9% negative predictive value (NPV). It was deployed in 17 major healthcare systems in India including diagnostic centers, large hospitals, and government hospitals. Over the deployment period, the system processed over 150,000 scans, averaging 2,000 chest X rays daily, resulting in reduced reporting times and improved diagnostic accuracy. Conclusion: The high precision and recall validate the AI's capability as a reliable tool for autonomous normal abnormal classification, pathology localization, and segmentation. This scalable AI model addresses diagnostic gaps in underserved areas, optimizing radiology workflows and enhancing patient care across diverse healthcare settings in India.

Comments:
27 pages , 8 figures

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68T07

Cite as:
arXiv:2504.00022 [eess.IV]

(or
arXiv:2504.00022v2 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2504.00022

Focus to learn more

              arXiv-issued DOI via DataCite</p>