本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新780篇论文,其中:

  • 自然语言处理244
  • 信息检索21
  • 计算机视觉174

自然语言处理

1. 【2505.18154】he Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas

链接https://arxiv.org/abs/2505.18154

作者:Ya Wu,Qiang Sheng,Danding Wang,Guang Yang,Yifan Sun,Zhengjia Wang,Yuyan Bu,Juan Cao

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:decision-support systems necessitates, moral reasoning capabilities, critical aspect, aspect of human, decision-support systems

备注: 25 pages, 8 figures

点击查看摘要

Abstract:Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.

2. 【2505.18152】Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

链接https://arxiv.org/abs/2505.18152

作者:Wafa Alghallabi,Ritesh Thawkar,Sara Ghaboura,Ketan More,Omkar Thawakar,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer

类目:Computation and Language (cs.CL)

关键词:deep historical continuity, Arabic poetry stands, Arabic poetry, culturally embedded forms, Arabic

备注: Github: [this https URL](https://github.com/mbzuai-oryx/FannOrFlop) , Dataset: [this https URL](https://huggingface.co/datasets/omkarthawakar/FannOrFlop)

点击查看摘要

Abstract:Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce `Fann or Flop`, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release `Fann or Flop` along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: this https URL.

3. 【2505.18149】First Finish Search: Efficient Test-Time Scaling in Large Language Models

链接https://arxiv.org/abs/2505.18149

作者:Aradhye Agarwal,Ayan Sengupta,Tanmoy Chakraborty

类目:Computation and Language (cs.CL)

关键词:involves dynamic allocation, Test-time scaling, large language models, offers a promising, involves dynamic

备注

点击查看摘要

Abstract:Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

4. 【2505.18148】Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

链接https://arxiv.org/abs/2505.18148

作者:Owen Bianchi,Mathew J. Koretsky,Maya Willey,Chelsea X. Alvarado,Tanay Nayak,Adi Asija,Nicole Kuznetsov,Mike A. Nalls,Faraz Faghri,Daniel Khashabi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, face significant challenges, Large language, large pool, face significant

备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.

5. 【2505.18136】Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism Detection

链接https://arxiv.org/abs/2505.18136

作者:Mykola Trokhymovych,Lydia Pintscher,Ricardo Baeza-Yates,Diego Saez-Trumper

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:largest open-source structured, next-generation vandalism detection, introduce a next-generation, largest open-source, open-source structured knowledge

备注

点击查看摘要

Abstract:We introduce a next-generation vandalism detection system for Wikidata, one of the largest open-source structured knowledge bases on the Web. Wikidata is highly complex: its items incorporate an ever-expanding universe of factual triples and multilingual texts. While edits can alter both structured and textual content, our approach converts all edits into a single space using a method we call Graph2Text. This allows for evaluating all content changes for potential vandalism using a single multilingual language model. This unified approach improves coverage and simplifies maintenance. Experiments demonstrate that our solution outperforms the current production system. Additionally, we are releasing the code under an open license along with a large dataset of various human-generated knowledge alterations, enabling further research.

6. 【2505.18135】Gaming Tool Preferences in Agentic LLMs

链接https://arxiv.org/abs/2505.18135

作者:Kazem Faghih,Wenxiao Wang,Yize Cheng,Siddhant Bharti,Gaurang Sriramanan,Sriram Balasubramanian,Parsa Hosseini,Soheil Feizi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:Large language models, Model Context Protocol, Large language, Context Protocol, Model Context

备注

点击查看摘要

Abstract:Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 10 different models. These phenomenons, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources.

7. 【2505.18134】VideoGameBench: Can Vision-Language Models complete popular video games?

链接https://arxiv.org/abs/2505.18134

作者:Alex L. Zhang,Thomas L. Griffiths,Karthik R. Narasimhan,Ofir Press

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved strong results, spatial navigation, remains understudied, memory management, achieved strong

备注: 9 pages, 33 pages including supplementary

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

8. 【2505.18129】One RL to See Them All: Visual Triple Unified Reinforcement Learning

链接https://arxiv.org/abs/2505.18129

作者:Yan Ma,Linge Du,Xuyang Shen,Shaoxiang Chen,Pengfei Li,Qibing Ren,Lizhuang Ma,Yuchao Dai,Pengfei Liu,Junjie Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Reinforcement learning, Unified Reinforcement Learning, Reinforcement Learning system, Triple Unified Reinforcement, capabilities of vision-language

备注: Technical Report

点击查看摘要

Abstract:Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at this https URL.

9. 【2505.18128】Frankentext: Stitching random text fragments into long-form narratives

链接https://arxiv.org/abs/2505.18128

作者:Chau Minh Pham,Jenna Russell,Dzung Pham,Mohit Iyyer

类目:Computation and Language (cs.CL)

关键词:long-form narratives produced, type of long-form, produced by LLMs, extreme constraint, copied verbatim

备注

点击查看摘要

Abstract:We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.

10. 【2505.18126】Reward Model Overoptimisation in Iterated RLHF

链接https://arxiv.org/abs/2505.18126

作者:Lorenz Wolf,Robert Kirk,Mirco Musolesi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:aligning large language, Reinforcement learning, large language models, RLHF, widely used method

备注: 20 pages, 17 figures, 5 tables

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.

11. 【2505.18125】abSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

链接https://arxiv.org/abs/2505.18125

作者:Alan Arazi,Eilam Shapira,Roi Reichart

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:boosting decision trees, achieved remarkable success, gradient boosting decision, decision trees, achieved remarkable

备注

点击查看摘要

Abstract:While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

12. 【2505.18122】UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

链接https://arxiv.org/abs/2505.18122

作者:Poojah Ganesan,Rajat Aayush Jha,Dan Roth,Vivek Gupta

类目:Computation and Language (cs.CL)

关键词:Recent advances, large language models, greatly improved, advances in large, large language

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.

13. 【2505.18121】ProgRM: Build Better GUI Agents with Progress Rewards

链接https://arxiv.org/abs/2505.18121

作者:Danyang Zhang,Situo Zhang,Ziyue Yang,Zichen Zhu,Zihan Zhao,Ruisheng Cao,Lu Chen,Kai Yu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Graphical User Interface, Large Language Model, Large Language, Graphical User, User Interface

备注

点击查看摘要

Abstract:LLM-based (Large Language Model) GUI (Graphical User Interface) agents can potentially reshape our daily lives significantly. However, current LLM-based GUI agents suffer from the scarcity of high-quality training data owing to the difficulties of trajectory collection and reward annotation. Existing works have been exploring LLMs to collect trajectories for imitation learning or to offer reward signals for online RL training. However, the Outcome Reward Model (ORM) used in existing works cannot provide finegrained feedback and can over-penalize the valuable steps in finally failed trajectories. To this end, we propose Progress Reward Model (ProgRM) to provide dense informative intermediate rewards by predicting a task completion progress for each step in online training. To handle the challenge of progress reward label annotation, we further design an efficient LCS-based (Longest Common Subsequence) self-annotation algorithm to discover the key steps in trajectories and assign progress labels accordingly. ProgRM is evaluated with extensive experiments and analyses. Actors trained with ProgRM outperform leading proprietary LLMs and ORM-trained actors, illustrating the effectiveness of ProgRM. The codes for experiments will be made publicly available upon acceptance.

14. 【2505.18116】Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

链接https://arxiv.org/abs/2505.18116

作者:Huayu Chen,Kaiwen Zheng,Qinsheng Zhang,Ganqu Cui,Yin Cui,Haotian Ye,Tsung-Yi Lin,Ming-Yu Liu,Jun Zhu,Haoxiang Wang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:binary verifier signals, Reinforcement Learning, verifier signals, played a central, central role

备注

点击查看摘要

Abstract:Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

15. 【2505.18110】Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

链接https://arxiv.org/abs/2505.18110

作者:Zinuo Li,Xian Zhang,Yongxin Guo,Mohammed Bennamoun,Farid Boussaid,Girish Dwivedi,Luqi Gong,Qiuhong Ke

类目:Computation and Language (cs.CL)

关键词:Humans naturally understand, naturally understand moments, Humans naturally, auditory cues, naturally understand

备注

点击查看摘要

Abstract:Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

16. 【2505.18105】ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

链接https://arxiv.org/abs/2505.18105

作者:Lisheng Huang,Yichen Liu,Jinhao Jiang,Rongxiang Zhang,Jiahao Yan,Junyi Li,Wayne Xin Zhao

类目:Computation and Language (cs.CL)

关键词:large language models, web-augmented large language, exhibited strong performance, Recent advances, complex reasoning tasks

备注: LLM, Complex Search Benchmark

点击查看摘要

Abstract:Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbf{ManuSearch}, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbf{ORION}, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in this https URL

17. 【2505.18102】How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

链接https://arxiv.org/abs/2505.18102

作者:Takashi Ishida,Thanawat Lodkaew,Ikko Yamane

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)

关键词:Internet risks contaminating, risks contaminating future, contaminating future LLMs, Publishing a large, Internet risks

备注

点击查看摘要

Abstract:Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

18. 【2505.18098】Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

链接https://arxiv.org/abs/2505.18098

作者:Joey Hong,Anca Dragan,Sergey Levine

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:require additional long-horizon, additional long-horizon reasoning, negotiation and persuasion, require additional, Large language models

备注: 18 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

19. 【2505.18092】QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization

链接https://arxiv.org/abs/2505.18092

作者:Weizhou Shen,Chenliang Li,Fanqi Wan,Shengyi Liao,Shaopeng Lai,Bo Zhang,Yingcheng Shi,Yuning Wu,Gang Fu,Zhansheng Li,Bin Yang,Ji Zhang,Fei Huang,Jingren Zhou,Ming Yan

类目:Computation and Language (cs.CL)

关键词:long sequence processing, technical report presents, large language models, compression framework designed, addressing prohibitive computation

备注

点击查看摘要

Abstract:This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2505.18092 [cs.CL]

(or
arXiv:2505.18092v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.18092

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2505.18091】Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

链接https://arxiv.org/abs/2505.18091

作者:Xinran Gu,Kaifeng Lyu,Jiazheng Li,Jingzhao Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, dense domain-specific knowledge, Language Models, Large Language, model size

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

21. 【2505.18079】Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

链接https://arxiv.org/abs/2505.18079

作者:Xiaoyi Zhang,Zhaoyang Jia,Zongyu Guo,Jiahao Li,Bin Li,Houqiang Li,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:presents significant challenges, significant challenges due, extensive temporal-spatial complexity, Large Language Models, understanding presents significant

备注: Under review

点击查看摘要

Abstract:Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

22. 【2505.18071】Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

链接https://arxiv.org/abs/2505.18071

作者:Jia-Nan Li,Jian Guan,Wei Wu,Rui Yan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, demonstrated significant success, Large language, math and coding, demonstrated significant

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning\textemdash the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose \textsc{AlignXplore}, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users' interaction histories. We develop \textsc{AlignXplore} by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that \textsc{AlignXplore} achieves substantial improvements over the backbone model by an average of 11.05\% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.

23. 【2505.18056】MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving

链接https://arxiv.org/abs/2505.18056

作者:Wei-Ling Hsu,Yu-Chien Tang,An-Zi Yen

类目:Computation and Language (cs.CL)

关键词:Online learning enhances, enhances educational accessibility, learn anytime, Online learning, flexibility to learn

备注: Pre-print

点击查看摘要

Abstract:Online learning enhances educational accessibility, offering students the flexibility to learn anytime, anywhere. However, a key limitation is the lack of immediate, personalized feedback, particularly in helping students correct errors in math problem-solving. Several studies have investigated the applications of large language models (LLMs) in educational contexts. In this paper, we explore the capabilities of LLMs to assess students' math problem-solving processes and provide adaptive feedback. The MathEDU dataset is introduced, comprising authentic student solutions annotated with teacher feedback. We evaluate the model's ability to support personalized learning in two scenarios: one where the model has access to students' prior answer histories, and another simulating a cold-start context. Experimental results show that the fine-tuned model performs well in identifying correctness. However, the model still faces challenges in generating detailed feedback for pedagogical purposes.

24. 【2505.18040】Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition

链接https://arxiv.org/abs/2505.18040

作者:Minxue Niu,Emily Mower Provost

类目:Computation and Language (cs.CL)

关键词:adaptable Emotion Recognition, building adaptable Emotion, Emotion Recognition, ability to handle, crucial for building

备注

点击查看摘要

Abstract:The ability to handle various emotion labels without dedicated training is crucial for building adaptable Emotion Recognition (ER) systems. Conventional ER models rely on training using fixed label sets and struggle to generalize beyond them. On the other hand, Large Language Models (LLMs) have shown strong zero-shot ER performance across diverse label spaces, but their scale limits their use on edge devices. In this work, we propose a contrastive distillation framework that transfers rich emotional knowledge from LLMs into a compact model without the use of human annotations. We use GPT-4 to generate descriptive emotion annotations, offering rich supervision beyond fixed label sets. By aligning text samples with emotion descriptors in a shared embedding space, our method enables zero-shot prediction on different emotion classes, granularity, and label schema. The distilled model is effective across multiple datasets and label spaces, outperforming strong baselines of similar size and approaching GPT-4's zero-shot performance, while being over 10,000 times smaller.

25. 【2505.18034】Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks

链接https://arxiv.org/abs/2505.18034

作者:Wentao Sun,Joao Paulo Nogueira,Alonso Silva

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLMs remain unreliable, causation from correlation, remarkable advances, remain unreliable, unreliable in distinguishing

备注

点击查看摘要

Abstract:Despite remarkable advances in the field, LLMs remain unreliable in distinguishing causation from correlation. Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs -- such as GPT-4 (F1 score: 29.08) -- only marginally outperform random baselines (Random Uniform, F1 score: 20.38), indicating limited capacity of generalization. To tackle this limitation, we propose a novel structured approach: rather than directly answering causal queries, we provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph, systematically encoding the provided correlational premises, to answer the causal queries. This intermediate representation significantly enhances the model's causal capabilities. Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods, improving F1 scores from 32.71 to 48.26 (over 47.5% relative increase), along with notable improvements in precision and recall. These results underscore the effectiveness of providing the model with the capability to structure its thinking and highlight its promising potential for broader generalization across diverse causal inference tasks.

26. 【2505.18011】raining with Pseudo-Code for Instruction Following

链接https://arxiv.org/abs/2505.18011

作者:Prince Kumar,Rudra Murthy,Riyaz Bhat,Danish Contractor

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, compositions are involved, Language Models, rapid progress

备注: Under Review

点击查看摘要

Abstract:Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on $11$ publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of $3$--$19$% on instruction-following benchmark, and an average gain of upto 14% across all tasks.

27. 【2505.17998】RACE for Tracking the Emergence of Semantic Representations in Transformers

链接https://arxiv.org/abs/2505.17998

作者:Nura Aljaafari,Danilo S. Carvalho,André Freitas

类目:Computation and Language (cs.CL)

关键词:remain poorly understood, Modern transformer models, Modern transformer, transitions remain poorly, transformer models exhibit

备注

点击查看摘要

Abstract:Modern transformer models exhibit phase transitions during training, distinct shifts from memorisation to abstraction, but the mechanisms underlying these transitions remain poorly understood. Prior work has often focused on endpoint representations or isolated signals like curvature or mutual information, typically in symbolic or arithmetic domains, overlooking the emergence of linguistic structure. We introduce TRACE (Tracking Representation Abstraction and Compositional Emergence), a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs. TRACE leverages a frame-semantic data generation method, ABSynth, that produces annotated synthetic corpora with controllable complexity, lexical distributions, and structural entropy, while being fully annotated with linguistic categories, enabling precise analysis of abstraction emergence. Experiments reveal that (i) phase transitions align with clear intersections between curvature collapse and dimension stabilisation; (ii) these geometric shifts coincide with emerging syntactic and semantic accuracy; (iii) abstraction patterns persist across architectural variants, with components like feedforward networks affecting optimisation stability rather than fundamentally altering trajectories. This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation that could inform more principled approaches to LM development.

28. 【2505.17997】owards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

链接https://arxiv.org/abs/2505.17997

作者:Jintian Shao,Yiming Cheng,Hongyi Huang,Beiwen Zhang,Zhiyu Wu,You Shan,Mingkai Zheng

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:demonstrated significant empirical, significant empirical success, large language models, learning for long, framework has demonstrated

备注

点击查看摘要

Abstract:The VAPO framework has demonstrated significant empirical success in enhancing the efficiency and reliability of reinforcement learning for long chain-of-thought (CoT) reasoning tasks with large language models (LLMs). By systematically addressing challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals, VAPO achieves state-of-the-art performance. While its practical benefits are evident, a deeper theoretical understanding of its underlying mechanisms and potential limitations is crucial for guiding future advancements. This paper aims to initiate such a discussion by exploring VAPO from a theoretical perspective, highlighting areas where its assumptions might be challenged and where further investigation could yield more robust and generalizable reasoning agents. We delve into the intricacies of value function approximation in complex reasoning spaces, the optimality of adaptive advantage estimation, the impact of token-level optimization, and the enduring challenges of exploration and generalization.

29. 【2505.17978】AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

链接https://arxiv.org/abs/2505.17978

作者:Rui Cao,Zifeng Ding,Zhijiang Guo,Michael Schlichtkrull,Andreas Vlachos

类目:Computation and Language (cs.CL)

关键词:Textual claims, social media, accompanied by images, images to enhance, enhance their credibility

备注

点击查看摘要

Abstract:Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation. Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict. In this work, we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict. We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVerImaTeC via inter-annotator studies, achieving a $\kappa=0.742$ on verdicts and $74.7\%$ consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.

30. 【2505.17968】Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

链接https://arxiv.org/abs/2505.17968

作者:Jiayi Geng,Howard Chen,Dilip Arumugam,Thomas L. Griffiths

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:accelerate scientific discovery, scientific discovery, potential to accelerate, accelerate scientific, create autonomous researchers

备注: 30 pages

点击查看摘要

Abstract:Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

31. 【2505.17964】Counting Cycles with Deepseek

链接https://arxiv.org/abs/2505.17964

作者:Jiashun Jin,Tracy Ke,Bingcheng Sui,Zhenggang Wang

类目:Computation and Language (cs.CL)

关键词:Efficient Equivalent Form, Computationally Efficient Equivalent, recent progress, advanced mathematics, struggles on advanced

备注

点击查看摘要

Abstract:Despite recent progress, AI still struggles on advanced mathematics. We consider a difficult open problem: How to derive a Computationally Efficient Equivalent Form (CEEF) for the cycle count statistic? The CEEF problem does not have known general solutions, and requires delicate combinatorics and tedious calculations. Such a task is hard to accomplish by humans but is an ideal example where AI can be very helpful. We solve the problem by combining a novel approach we propose and the powerful coding skills of AI. Our results use delicate graph theory and contain new formulas for general cases that have not been discovered before. We find that, while AI is unable to solve the problem all by itself, it is able to solve it if we provide it with a clear strategy, a step-by-step guidance and carefully written prompts. For simplicity, we focus our study on DeepSeek-R1 but we also investigate other AI approaches.

32. 【2505.17952】Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

链接https://arxiv.org/abs/2505.17952

作者:Che Liu,Haozhe Wang,Jiazhen Pan,Zhongwei Wan,Yong Dai,Fangzhen Lin,Wenjia Bai,Daniel Rueckert,Rossella Arcucci

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enabling interpretable decision, interpretable decision making, large language models, Improving performance, clinical applications

备注: Under Review

点击查看摘要

Abstract:Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

33. 【2505.17950】Handling Symbolic Language in Student Texts: A Comparative Study of NLP Embedding Models

链接https://arxiv.org/abs/2505.17950

作者:Tom Bleckmann,Paul Tschisgale

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Physics Education (physics.ed-ph)

关键词:Natural Language Processing, Recent advancements, advancements in Natural, Language Processing, Natural Language

备注

点击查看摘要

Abstract:Recent advancements in Natural Language Processing (NLP) have facilitated the analysis of student-generated language products in learning analytics (LA), particularly through the use of NLP embedding models. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing studies and applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased findings and diminished performance of LA applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: similarity-based analyses and integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Beyond performance, additional factors such as cost, regulatory compliance, and model transparency are discussed as key considerations for model selection. Overall, this study underscores the importance for LA researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions.

34. 【2505.17936】Understanding Gated Neurons in Transformers from Their Input-Output Functionality

链接https://arxiv.org/abs/2505.17936

作者:Sebastian Gerstner,Hinrich Schütze

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:understand MLP neurons, Interpretability researchers, understand MLP, language models based, output weight vectors

备注: 31 pages, 22 figures

点击查看摘要

Abstract:Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

35. 【2505.17928】owards Practical Defect-Focused Automated Code Review

链接https://arxiv.org/abs/2505.17928

作者:Junyi Lu,Lili Jiang,Xiaojia Li,Jianbing Fang,Fengjun Zhang,Li Yang,Chun Zuo

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:automate review comments, prior approaches oversimplify, text similarity metrics, metrics like BLEU, review comments

备注: Accepted to Forty-Second International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.

36. 【2505.17923】Language models can learn implicit multi-hop reasoning, but only if they have lots of training data

链接https://arxiv.org/abs/2505.17923

作者:Yuekun Yao,Yupei Du,Dawei Zhu,Michael Hahn,Alexander Koller

类目:Computation and Language (cs.CL)

关键词:single forward pass, solve multi-hop reasoning, multi-hop reasoning tasks, forward pass, chain of thought

备注

点击查看摘要

Abstract:Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought. We investigate this capability using GPT2-style language models trained from scratch on controlled $k$-hop reasoning datasets ($k = 2, 3, 4$). We show that while such models can indeed learn implicit $k$-hop reasoning, the required training data grows exponentially in $k$, and the required number of transformer layers grows linearly in $k$. We offer a theoretical explanation for why this depth growth is necessary. We further find that the data requirement can be mitigated, but not eliminated, through curriculum learning.

37. 【2505.17897】2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

链接https://arxiv.org/abs/2505.17897

作者:Zi-Ao Ma,Tian Lan,Rong-Cheng Tu,Shu-Hang Liu,Heyan Huang,Zhijing Wu,Chen Xu,Xian-Ling Mao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:human annotation burden, progress in diffusion-based, generation has created, annotation burden, rapid progress

备注

点击查看摘要

Abstract:The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.

38. 【2505.17894】Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

链接https://arxiv.org/abs/2505.17894

作者:Khalil Hennara,Muhammad Hreden,Mohamed Motaism Hamed,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:powerful language model, compact yet powerful, bidirectional Arabic-English translation, Mutarjim, powerful language

备注

点击查看摘要

Abstract:We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.

39. 【2505.17873】MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

链接https://arxiv.org/abs/2505.17873

作者:Wanhao Liu,Zonglin Yang,Jue Wang,Lidong Bing,Di Zhang,Dongzhan Zhou,Yuqiang Li,Houqiang Li,Erik Cambria,Wanli Ouyang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:automated scientific discovery, scientific discovery, costly and throughput-limited, crucial component, component of automated

备注

点击查看摘要

Abstract:Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.

40. 【2505.17870】Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

链接https://arxiv.org/abs/2505.17870

作者:Shaina Raza,Rizwan Qureshi,Marcelo Lotif,Aman Chadha,Deval Pandya,Christos Emmanouilidis

类目:Computation and Language (cs.CL)

关键词:reproduce false information, false information present, learn and reproduce, information present, Generative

备注

点击查看摘要

Abstract:Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a "vaccine" against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.

41. 【2505.17855】Explaining Sources of Uncertainty in Automated Fact-Checking

链接https://arxiv.org/abs/2505.17855

作者:Jingyi Sun,Greta Warren,Irina Shklovski,Isabelle Augenstein

类目:Computation and Language (cs.CL)

关键词:effective human-AI collaboration, Understanding sources, human-AI collaboration, predictions is crucial, crucial for effective

备注

点击查看摘要

Abstract:Understanding sources of a model's uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes using numerical uncertainty or hedges ("I'm not sure, but ..."), which do not explain uncertainty that arises from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (Conflict-and-Agreement-aware Language-model Uncertainty Explanations), the first framework to generate natural language explanations of model uncertainty by (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts and agreements that drive the model's predictive uncertainty in an unsupervised way, and (ii) generating explanations via prompting and attention steering that verbalize these critical interactions. Across three language models and two fact-checking datasets, we show that CLUE produces explanations that are more faithful to the model's uncertainty and more consistent with fact-checking decisions than prompting for uncertainty explanations without span-interaction guidance. Human evaluators judge our explanations to be more helpful, more informative, less redundant, and more logically consistent with the input than this baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and generalises readily to other tasks that require reasoning over complex information.

42. 【2505.17833】Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech Corpus

链接https://arxiv.org/abs/2505.17833

作者:Kalle Lahtinen,Einari Vaaras,Liisa Mustanoja,Okko Räsänen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:requires suitable data, speech requires suitable, Study of affect, requires suitable, perception vary

备注: Accepted for publication at Interspeech 2025, Rotterdam, The Netherlands

点击查看摘要

Abstract:Study of affect in speech requires suitable data, as emotional expression and perception vary across languages. Until now, no corpus has existed for natural expression of affect in spontaneous Finnish, existing data being acted or from a very specific communicative setting. This paper presents the first such corpus, created by annotating 12,000 utterances for emotional arousal and valence, sampled from three large-scale Finnish speech corpora. To ensure diverse affective expression, sample selection was conducted with an affect mining approach combining acoustic, cross-linguistic speech emotion, and text sentiment features. We compare this method to random sampling in terms of annotation diversity, and conduct post-hoc analyses to identify sampling choices that would have maximized the diversity. As an outcome, the work introduces a spontaneous Finnish affective speech corpus and informs sampling strategies for affective speech corpus creation in other languages or domains.

43. 【2505.17832】Emerging categories in scientific explanations

链接https://arxiv.org/abs/2505.17832

作者:Giacomo Magnifico,Eduard Barbu

类目:Computation and Language (cs.CL)

关键词:Clear and effective, knowledge dissemination, essential for human, human understanding, understanding and knowledge

备注: Accepted at the 3rd TRR 318 Conference: Contextualizing Explanations (ContEx25), as a two-pager abstract. Will be published at BiUP (Bielefeld University Press) at a later date

点击查看摘要

Abstract:Clear and effective explanations are essential for human understanding and knowledge dissemination. The scope of scientific research aiming to understand the essence of explanations has recently expanded from the social sciences to machine learning and artificial intelligence. Explanations for machine learning decisions must be impactful and human-like, and there is a lack of large-scale datasets focusing on human-like and human-generated explanations. This work aims to provide such a dataset by: extracting sentences that indicate explanations from scientific literature among various sources in the biotechnology and biophysics topic domains (e.g. PubMed's PMC Open Access subset); providing a multi-class notation derived inductively from the data; evaluating annotator consensus on the emerging categories. The sentences are organized in an openly-available dataset, with two different classifications (6-class and 3-class category annotation), and the 3-class notation achieves a 0.667 Krippendorf Alpha value.

44. 【2505.17829】Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning

链接https://arxiv.org/abs/2505.17829

作者:Zezhong Wang,Xingshan Zeng,Weiwen Liu,Yufei Wang,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, Test-Time Scaling, capability of Large

备注

点击查看摘要

Abstract:Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.

45. 【2505.17827】Not All Tokens Are What You Need In Thinking

链接https://arxiv.org/abs/2505.17827

作者:Hang Yuan,Bin Yu,Haotian Li,Shijun Yang,Christina Dan Wang,Zhou Yu,Xueyin Xu,Weizhen Qi,Kai Chen

类目:Computation and Language (cs.CL)

关键词:high inference latency, exhibit impressive problem-solving, excessive computational resource, computational resource consumption, generating verbose chains

备注: 11 pages, 7 figures and 3 tables

点击查看摘要

Abstract:Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.

46. 【2505.17826】rinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

链接https://arxiv.org/abs/2505.17826

作者:Xuchen Pan,Yanxi Chen,Yushuo Chen,Yuchang Sun,Daoyuan Chen,Wenhao Zhang,Yuexiang Xie,Yilun Huang,Yilei Zhang,Dawei Gao,Yaliang Li,Bolin Ding,Jingren Zhou

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:large language models, scalable framework designed, flexible and scalable, language models, large language

备注: This technical report will be continuously updated as the codebase evolves. GitHub: [this https URL](https://github.com/modelscope/Trinity-RFT)

点击查看摘要

Abstract:Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT, (2) seamless integration for agent-environment interaction with high efficiency and robustness, and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for exploring advanced reinforcement learning paradigms. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples demonstrating the utility and user-friendliness of the proposed framework.

47. 【2505.17818】PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

链接https://arxiv.org/abs/2505.17818

作者:Daeun Kyung,Hyunseung Chung,Seongsu Bae,Jiho Kim,Jae Ho Sohn,Taerim Kim,Soo Kyung Kim,Edward Choi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Doctor-patient consultations require, context-aware communication tailored, consultations require multi-turn, Doctor-patient consultations, context-aware communication

备注: 9 pages for main text, 4 pages for references, 27 pages for supplementary materials

点击查看摘要

Abstract:Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluated eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3, was validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.

48. 【2505.17816】Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

链接https://arxiv.org/abs/2505.17816

作者:Hei Yi Mak,Tan Lee

类目:Computation and Language (cs.CL)

关键词:Hong Kong, inhabitants in Hong, primary spoken language, Cantonese, written Cantonese

备注: Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

点击查看摘要

Abstract:The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.

49. 【2505.17813】Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

链接https://arxiv.org/abs/2505.17813

作者:Michael Hassid,Gabriel Synnaeve,Yossi Adi,Roy Schwartz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, Reasoning large language, complex reasoning tasks, perform complex reasoning, language models

备注: Preprint. Under review

点击查看摘要

Abstract:Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

50. 【2505.17795】DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

链接https://arxiv.org/abs/2505.17795

作者:Tazeek Bin Abdur Rakib,Ambuj Mehrish,Lay-Ki Soon,Wern Han Lim,Soujanya Poria

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:goal-driven interactions due, agents excel, struggle with proactive, goal-driven interactions, excel at reactive

备注

点击查看摘要

Abstract:Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under $3$ turns with success rates exceeding 94\% and, with a larger LLM prior, pushes success above 97\% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at this https URL

51. 【2505.17793】Compression Hacking: A Supplementary Perspective on Informatics Metric of Language Models from Geometric Distortion

链接https://arxiv.org/abs/2505.17793

作者:Jianxiang Zang,Meiling Ning,Yongda Wei,Shihan Dou,Jiazheng Zhang,Nijia Mo,Binhong Li,Tao Gui,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL)

关键词:structured representations signify, highly structured representations, intelligence level, language models, perspective for language

备注

点击查看摘要

Abstract:Recently, the concept of ``compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the ``Compression Hacking'' in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM's comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.

52. 【2505.17784】EXECUTE: A Multilingual Benchmark for LLM Token Understanding

链接https://arxiv.org/abs/2505.17784

作者:Lukas Edman,Helmut Schmid,Alexander Fraser

类目:Computation and Language (cs.CL)

关键词:CUTE benchmark showed, CUTE benchmark, benchmark showed, CUTE, English

备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.

53. 【2505.17767】he Real Barrier to LLM Agent Usability is Agentic ROI

链接https://arxiv.org/abs/2505.17767

作者:Weiwen Liu,Jiarui Qin,Xu Huang,Xingshan Zeng,Yunjia Xi,Jianghao Lin,Chuhan Wu,Yasheng Wang,Lifeng Shang,Ruiming Tang,Defu Lian,Yong Yu,Weinan Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, passive prompt-response systems, autonomous agents capable, Language Model

备注

点击查看摘要

Abstract:Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. Despite the widespread application in specialized, high-effort tasks like coding and scientific research, we highlight a critical usability gap in high-demand, mass-market applications. This position paper argues that the limited real-world adoption of LLM agents stems not only from gaps in model capabilities, but also from a fundamental tradeoff between the value an agent can provide and the costs incurred during real-world use. Hence, we call for a shift from solely optimizing model performance to a broader, utility-driven perspective: evaluating agents through the lens of the overall agentic return on investment (Agent ROI). By identifying key factors that determine Agentic ROI--information quality, agent time, and cost--we posit a zigzag development trajectory in optimizing agentic ROI: first scaling up to improve the information quality, then scaling down to minimize the time and cost. We outline the roadmap across different development stages to bridge the current usability gaps, aiming to make LLM agents truly scalable, accessible, and effective in real-world contexts.

54. 【2505.17762】Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs

链接https://arxiv.org/abs/2505.17762

作者:Ziyu Ge,Yuhao Wu,Daniel Wai Kit Chin,Roy Ka-Wei Lee,Rui Cao

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, integrating external knowledge, demonstrated significant potential, Language Models

备注: Camera-ready for IJCAI 2025, AI and Social Good

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for \textbf{Fact}-Checking) (Dataset available at this https URL), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.

55. 【2505.17747】Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

链接https://arxiv.org/abs/2505.17747

作者:Maureen de Seyssel,Jie Chi,Skyler Seto,Maartje ter Hoeve,Masha Fedzechkina,Natalie Schluter

类目:Computation and Language (cs.CL)

关键词:represent language identity, training-free ABX-style discrimination, language models represent, models represent language, semantic content

备注

点击查看摘要

Abstract:We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.

56. 【2505.17746】Fast Quiet-STaR: Thinking Without Thought Tokens

链接https://arxiv.org/abs/2505.17746

作者:Wei Huang,Yizhe Xiong,Xin Ye,Zhijie Deng,Hui Chen,Zijia Lin,Guiguang Ding

类目:Computation and Language (cs.CL)

关键词:Large Language Models, natural language processing, Large Language, language processing tasks, achieved impressive performance

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9\% on Mistral 7B and 5.7\% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at this https URL.

57. 【2505.17733】he Pilot Corpus of the English Semantic Sketches

链接https://arxiv.org/abs/2505.17733

作者:Maria Petrova,Maria Ponomareva,Alexandra Ivoylova

类目:Computation and Language (cs.CL)

关键词:English verbs, paper is devoted, sketches for English, English, sketches

备注

点击查看摘要

Abstract:The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.

58. 【2505.17714】PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

链接https://arxiv.org/abs/2505.17714

作者:Ben Rahman

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Proximal Policy Optimization, aggressive clipping stifles, clipping stifles early, late-stage updates destabilize, dominating policy gradient

备注: This manuscript builds upon an earlier version posted to TechRxiv. This arXiv version includes an updated comparison with GRPO (Group Relative Policy Optimization)

点击查看摘要

Abstract:Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six diverse benchmarks (MuJoCo, Atari, sparse-reward), PPO-BR achieves 29.1% faster convergence (p 0.001), 2.3x lower reward variance than PPO, and less than 1.8% runtime overhead with only five lines of code change. PPO-BR's simplicity and theoretical guarantees make it ready-to-deploy in safety-critical domains -- from surgical robotics to autonomous drones. In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO-BR offers a unified entropy-reward mechanism applicable to both language models and general reinforcement learning environments.

59. 【2505.17712】Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

链接https://arxiv.org/abs/2505.17712

作者:Yi Su,Jiayi Zhang,Shu Yang,Xinhai Wang,Lijie Hu,Di Wang

类目:Computation and Language (cs.CL)

关键词:Rapid integration, universal ethical principles, representations remain opaque, large language models, ethical principles

备注

点击查看摘要

Abstract:Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.

60. 【2505.17704】SemSketches-2021: experimenting with the machine processing of the pilot semantic sketches corpus

链接https://arxiv.org/abs/2505.17704

作者:Maria Ponomareva,Maria Petrova,Julia Detkova,Oleg Serikov,Maria Yarova

类目:Computation and Language (cs.CL)

关键词:semantic sketches, paper deals, deals with elaborating, elaborating different approaches, sketches

备注

点击查看摘要

Abstract:The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches-2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.

61. 【2505.17701】COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection

链接https://arxiv.org/abs/2505.17701

作者:Jaewon Cheon,Pilsung Kang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, significant computational inefficiencies, created significant computational, growing size, size of large

备注

点击查看摘要

Abstract:The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.

62. 【2505.17697】Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

链接https://arxiv.org/abs/2505.17697

作者:Zekai Zhao,Qi Liu,Kun Zhou,Zihan Liu,Yifei Shao,Zhiting Hu,Biwei Huang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:typically requires costly, large language models, requires costly reinforcement, costly reinforcement learning, high-quality distilled data

备注

点击查看摘要

Abstract:Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.

63. 【2505.17691】ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

链接https://arxiv.org/abs/2505.17691

作者:Yan Yu,Yilun Liu,Minggui He,Shimin Tao,Weibin Meng,Xinhua Yang,Li Zhang,Hongxia Ma,Chang Su,Hao Yang,Fuliang Li

类目:Computation and Language (cs.CL)

关键词:comparisons remains unresolved, Large language models, Large language, pairwise comparisons remains, pairwise comparisons

备注

点击查看摘要

Abstract:Large language models (LLMs) are widely used as evaluators for open-ended tasks, while previous research has emphasized biases in LLM evaluations, the issue of non-transitivity in pairwise comparisons remains unresolved: non-transitive preferences for pairwise comparisons, where evaluators prefer A over B, B over C, but C over A. Our results suggest that low-quality training data may reduce the transitivity of preferences generated by the Evaluator LLM. To address this, We propose a graph-theoretic framework to analyze and mitigate this problem by modeling pairwise preferences as tournament graphs. We quantify non-transitivity and introduce directed graph structural entropy to measure the overall clarity of preferences. Our analysis reveals significant non-transitivity in advanced Evaluator LLMs (with Qwen2.5-Max exhibiting 67.96%), as well as high entropy values (0.8095 for Qwen2.5-Max), reflecting low overall clarity of preferences. To address this issue, we designed a filtering strategy, ELSPR, to eliminate preference data that induces non-transitivity, retaining only consistent and transitive preference data for model fine-tuning. Experiments demonstrate that models fine-tuned with filtered data reduce non-transitivity by 13.78% (from 64.28% to 50.50%), decrease structural entropy by 0.0879 (from 0.8113 to 0.7234), and align more closely with human evaluators (human agreement rate improves by 0.6% and Spearman correlation increases by 0.01).

64. 【2505.17682】uning Language Models for Robust Prediction of Diverse User Behaviors

链接https://arxiv.org/abs/2505.17682

作者:Fanjin Meng,Jingtao Ding,Jiahui Gong,Chen Yang,Hong Chen,Zuojian Wang,Haisheng Lu,Yong Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Predicting user behavior, intelligent assistant services, Predicting user, deep learning models, capture long-tailed behaviors

备注

点击查看摘要

Abstract:Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.

65. 【2505.17671】MIDB: Multilingual Instruction Data Booster for Enhancing Multilingual Instruction Synthesis

链接https://arxiv.org/abs/2505.17671

作者:Yilun Liu,Chunguang Zhao,Xinhua Yang,Hongyong Zeng,Shimin Tao,Weibin Meng,Minggui He,Chang Su,Yan Yu,Hongxia Ma,Li Zhang,Daimeng Wei,Hao Yang

类目:Computation and Language (cs.CL)

关键词:English synthesized data, multilingual synthesized instruction, data, synthesized instruction pairs, synthesized data

备注

点击查看摘要

Abstract:Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced.

66. 【2505.17667】QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

链接https://arxiv.org/abs/2505.17667

作者:Fanqi Wan,Weizhou Shen,Shengyi Liao,Yingcheng Shi,Chenliang Li,Ziyi Yang,Ji Zhang,Fei Huang,Jingren Zhou,Ming Yan

类目:Computation and Language (cs.CL)

关键词:Recent large reasoning, Recent large, large reasoning models, demonstrated strong reasoning, strong reasoning capabilities

备注: Technical Report

点击查看摘要

Abstract:Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.

67. 【2505.17663】owards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

链接https://arxiv.org/abs/2505.17663

作者:Yang Xiao,Jiashuo Wang,Qiancheng Xu,Changhe Song,Chunpu Xu,Yi Cheng,Wenjie Li,Pengfei Liu

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large Language Models, Theory of Mind, Large Language, evaluating their Theory, mental states

备注: Accepted by ACL 2025 Main Conference

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.

68. 【2505.17656】oo Consistent to Detect: A Study of Self-Consistent Errors in LLMs

链接https://arxiv.org/abs/2505.17656

作者:Hexiang Tan,Fei Sun,Sha Liu,Du Su,Qi Cao,Xin Chen,Jingang Wang,Xunliang Cai,Yuanzhuo Wang,Huawei Shen,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:large language models, self-consistent errors, language models, ensure truthfulness, large language

备注: Underreview in EMNLP25

点击查看摘要

Abstract:As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methshods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.

69. 【2505.17654】EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

链接https://arxiv.org/abs/2505.17654

作者:Ancheng Xu,Zhihao Yang,Jingpeng Li,Guanghu Yuan,Longze Chen,Liang Yan,Jiehui Zhou,Zhen Qin,Hengyun Chang,Hamid Alinejad-Rokny,Bo Zheng,Min Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, platforms increasingly rely, rely on Large, E-commerce platforms increasingly

备注

点击查看摘要

Abstract:E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at this https URL.

70. 【2505.17645】HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

链接https://arxiv.org/abs/2505.17645

作者:Chuhao Zhou,Jianfei Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:diverse sensory inputs, Embodied agents operating, Large Language Model, understand human behavior, Multimodal Large Language

备注: 18 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

71. 【2505.17643】Bridging Electronic Health Records and Clinical Texts: Contrastive Learning for Enhanced Clinical Tasks

链接https://arxiv.org/abs/2505.17643

作者:Sara Ketabi,Dhanesh Ramachandram

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Conventional machine learning, electronic health record, Conventional machine, demonstrated promising performance, tree-based approaches

备注

点击查看摘要

Abstract:Conventional machine learning models, particularly tree-based approaches, have demonstrated promising performance across various clinical prediction tasks using electronic health record (EHR) data. Despite their strengths, these models struggle with tasks that require deeper contextual understanding, such as predicting 30-day hospital readmission. This can be primarily due to the limited semantic information available in structured EHR data. To address this limitation, we propose a deep multimodal contrastive learning (CL) framework that aligns the latent representations of structured EHR data with unstructured discharge summary notes. It works by pulling together paired EHR and text embeddings while pushing apart unpaired ones. Fine-tuning the pretrained EHR encoder extracted from this framework significantly boosts downstream task performance, e.g., a 4.1% AUROC enhancement over XGBoost for 30-day readmission prediction. Such results demonstrate the effect of integrating domain knowledge from clinical notes into EHR-based pipelines, enabling more accurate and context-aware clinical decision support systems.

72. 【2505.17642】Stereotype Detection in Natural Language Processing

链接https://arxiv.org/abs/2505.17642

作者:Alessandra Teresa Cignarella,Anastasia Giachanou,Els Lefever

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:influence social perceptions, Stereotypes influence social, discrimination and violence, influence social, social perceptions

备注

点击查看摘要

Abstract:Stereotypes influence social perceptions and can escalate into discrimination and violence. While NLP research has extensively addressed gender bias and hate speech, stereotype detection remains an emerging field with significant societal implications. In this work is presented a survey of existing research, analyzing definitions from psychology, sociology, and philosophy. A semi-automatic literature review was performed by using Semantic Scholar. We retrieved and filtered over 6,000 papers (in the year range 2000-2025), identifying key trends, methodologies, challenges and future directions. The findings emphasize stereotype detection as a potential early-monitoring tool to prevent bias escalation and the rise of hate speech. Conclusions highlight the need for a broader, multilingual, and intersectional approach in NLP studies.

73. 【2505.17636】Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis

链接https://arxiv.org/abs/2505.17636

作者:Jonathan Bennion,Shaona Ghosh,Mantek Singh,Nouha Dziri

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:developed to measure, measure LLMs, UMAP dimensionality reduction, Abstract, clusters using UMAP

备注: 6th International Conference on Advanced Natural Language Processing (AdNLP 2025), May 17 ~ 18, 2025, Zurich, Switzerland

点击查看摘要

Abstract:Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.

74. 【2505.17630】GIM: Improved Interpretability for Large Language Models

链接https://arxiv.org/abs/2505.17630

作者:Joakim Edin,Róbert Csordás,Tuukka Ruotsalo,Zhengxuan Wu,Maria Maistro,Jing Huang,Lars Maaløe

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Ensuring faithful interpretability, faithful interpretability, imperative for trustworthy, trustworthy and reliable, Gradient Interaction Modifications

备注

点击查看摘要

Abstract:Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at this https URL.

75. 【2505.17625】Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

链接https://arxiv.org/abs/2505.17625

作者:Hayato Aida,Kosuke Takahashi,Takahiro Omi

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Language Models, understand table structures, Large Language, retrieval-augmented generation

备注: Accepted at IIAI AAI 2025, the 3rd International Conference on Computational and Data Sciences in Economics and Finance

点击查看摘要

Abstract:With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features. Experimental results demonstrate that these auxiliary modalities significantly improve performance, enabling robust interpretation of complex document layouts without relying on explicitly structured input formats.

76. 【2505.17616】Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

链接https://arxiv.org/abs/2505.17616

作者:Qingyu Lu,Liang Ding,Siyi Cao,Xuebo Liu,Kanjian Zhang,Jinxia Zhang,Dacheng Tao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, demonstrated strong planning, language models, powered by large, large language

备注: Under Review

点击查看摘要

Abstract:Agents powered by large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments. However, such agents often suffer from inefficiencies in multi-turn interactions, frequently trapped in repetitive loops or issuing ineffective commands, leading to redundant computational overhead. Instead of relying solely on learning from trajectories, we take a first step toward exploring the early-exit behavior for LLM-based agents. We propose two complementary approaches: 1. an $\textbf{intrinsic}$ method that injects exit instructions during generation, and 2. an $\textbf{extrinsic}$ method that verifies task completion to determine when to halt an agent's trial. To evaluate early-exit mechanisms, we introduce two metrics: one measures the reduction of $\textbf{redundant steps}$ as a positive effect, and the other evaluates $\textbf{progress degradation}$ as a negative effect. Experiments with 4 different LLMs across 5 embodied environments show significant efficiency improvements, with only minor drops in agent performance. We also validate a practical strategy where a stronger agent assists after an early-exit agent, achieving better performance with the same total steps. We will release our code to support further research.

77. 【2505.17615】Large language model as user daily behavior data generator: balancing population diversity and individual personality

链接https://arxiv.org/abs/2505.17615

作者:Haoxin Li,Jingtao Ding,Jiahui Gong,Yong Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Predicting human daily, Predicting human, short-term fluctuations, challenging due, complexity of routine

备注: 14 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Predicting human daily behavior is challenging due to the complexity of routine patterns and short-term fluctuations. While data-driven models have improved behavior prediction by leveraging empirical data from various platforms and devices, the reliance on sensitive, large-scale user data raises privacy concerns and limits data availability. Synthetic data generation has emerged as a promising solution, though existing methods are often limited to specific applications. In this work, we introduce BehaviorGen, a framework that uses large language models (LLMs) to generate high-quality synthetic behavior data. By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models. We evaluate its performance in scenarios such as pertaining augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions, with gains of up to 18.9%. Our results demonstrate the potential of BehaviorGen to enhance user behavior modeling through flexible and privacy-preserving synthetic data generation.

78. 【2505.17613】MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

链接https://arxiv.org/abs/2505.17613

作者:Jihan Yao,Yushi Hu,Yujie Yi,Bin Han,Shangbin Feng,Guang Yang,Bingbing Wen,Ranjay Krishna,Lucy Lu Wang,Yulia Tsvetkov,Noah A. Smith,Banghua Zhu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatically evaluating multimodal, involve multiple modalities, Automatically evaluating, present significant challenges, evaluating multimodal generation

备注

点击查看摘要

Abstract:Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

79. 【2505.17612】Distilling LLM Agent into Small Models with Retrieval and Code Tools

链接https://arxiv.org/abs/2505.17612

作者:Minki Kang,Jongwon Jeong,Seanie Lee,Jaewoong Cho,Sung Ju Hwang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, remain computationally expensive, Large language, excel at complex, computationally expensive

备注: preprint, v1

点击查看摘要

Abstract:Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at this https URL.

80. 【2505.17607】Controlled Agentic Planning Reasoning for Mechanism Synthesis

链接https://arxiv.org/abs/2505.17607

作者:João Pedro Gandarela,Thiago Rios,Stefan Menzel,André Freitas

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:based reasoning method, dual-agent Large Language, Large Language Model, based reasoning, dynamic outcomes

备注: 24 pages, 16 figures

点击查看摘要

Abstract:This work presents a dual-agent Large Language Model (LLM)-based reasoning method for mechanism synthesis, capable of reasoning at both linguistic and symbolic levels to generate geometrical and dynamic outcomes. The model consists of a composition of well-defined functions that, starting from a natural language specification, references abstract properties through supporting equations, generates and parametrizes simulation code, and elicits feedback anchor points using symbolic regression and distance functions. This process closes an actionable refinement loop at the linguistic and symbolic layers. The approach is shown to be both effective and convergent in the context of planar mechanisms. Additionally, we introduce MSynth, a novel benchmark for planar mechanism synthesis, and perform a comprehensive analysis of the impact of the model components. We further demonstrate that symbolic regression prompts unlock mechanistic insights only when applied to sufficiently large architectures.

81. 【2505.17601】Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

链接https://arxiv.org/abs/2505.17601

作者:Jiawei Kong,Hao Fang,Xiaochen Yang,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang,Min Zhang

类目:Computation and Language (cs.CL)

关键词:labeled task-specific data, Supervised fine-tuning, aligns large language, aligns large, task-specific data

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model's safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.

82. 【2505.17598】One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs

链接https://arxiv.org/abs/2505.17598

作者:Linbao Li,Yannan Liu,Daojing He,Yu Li

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Safety alignment, large language models, unintended content, alignment in large, large language

备注

点击查看摘要

Abstract:Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts. We make the codebase available at this https URL.

83. 【2505.17595】NeUQI: Near-Optimal Uniform Quantization Parameter Initialization

链接https://arxiv.org/abs/2505.17595

作者:Li Lin,Xinyu Hu,Xiaojun Wan

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language models, face significant challenges, Large language, high memory consumption, due to high

备注: 9 pages, under review

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.

84. 【2505.17571】Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation

链接https://arxiv.org/abs/2505.17571

作者:Sichun Luo,Guanzhi Deng,Jian Xu,Xiaojie Zhang,Hanxu Hou,Linqi Song

类目:Computation and Language (cs.CL)

关键词:modern intelligent systems, spanning diverse domains, applications spanning diverse, intelligent systems, diverse domains

备注

点击查看摘要

Abstract:Personalization is a critical task in modern intelligent systems, with applications spanning diverse domains, including interactions with large language models (LLMs). Recent advances in reasoning capabilities have significantly enhanced LLMs, enabling unprecedented performance in tasks such as mathematics and coding. However, their potential for personalization tasks remains underexplored. In this paper, we present the first systematic evaluation of large reasoning models (LRMs) for personalization tasks. Surprisingly, despite generating more tokens, LRMs do not consistently outperform general-purpose LLMs, especially in retrieval-intensive scenarios where their advantages diminish. Our analysis identifies three key limitations: divergent thinking, misalignment of response formats, and ineffective use of retrieved information. To address these challenges, we propose Reinforced Reasoning for Personalization (\model), a novel framework that incorporates a hierarchical reasoning thought template to guide LRMs in generating structured outputs. Additionally, we introduce a reasoning process intervention method to enforce adherence to designed reasoning patterns, enhancing alignment. We also propose a cross-referencing mechanism to ensure consistency. Extensive experiments demonstrate that our approach significantly outperforms existing techniques.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2505.17571 [cs.CL]

(or
arXiv:2505.17571v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.17571

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
85. 【2505.17565】PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models

链接https://arxiv.org/abs/2505.17565

作者:Wei Zhou,Mohsen Mesgar,Heike Adel,Annemarie Friedrich

类目:Computation and Language (cs.CL)

关键词:Improving large language, Improving large, code generation, large language models, large language

备注

点击查看摘要

Abstract:Improving large language models (LLMs) with self-generated data has demonstrated success in tasks such as mathematical reasoning and code generation. Yet, no exploration has been made on table question answering (TQA), where a system answers questions based on tabular data. Addressing this gap is crucial for TQA, as effective self-improvement can boost performance without requiring costly or manually annotated data. In this work, we propose PPT, a Process-based Preference learning framework for TQA. It decomposes reasoning chains into discrete states, assigns scores to each state, and samples contrastive steps for preference learning. Experimental results show that PPT effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets, with only 8,000 preference pairs. Furthermore, the resulting models achieve competitive results compared to more complex and larger state-of-the-art TQA systems, while being five times more efficient during inference.

86. 【2505.17558】aching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

链接https://arxiv.org/abs/2505.17558

作者:Shrey Pandit,Ashwin Vinod,Liu Leqi,Ying Ding

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Aligning large language, significant challenge due, Aligning large, accurately detect hallucinations, detect hallucinations remains

备注: Code and dataset are available at [this https URL](https://teachingwithlies.github.io/)

点击查看摘要

Abstract:Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

87. 【2505.17553】CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning

链接https://arxiv.org/abs/2505.17553

作者:Jinyuan Feng,Chaopeng Wei,Tenghai Qiu,Tianyi Hu,Zhiqiang Pu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:involves specializing functionalities, parameter-efficient fine-tuning, activating them appropriately, computation overhead, involves specializing

备注

点击查看摘要

Abstract:In parameter-efficient fine-tuning, mixture-of-experts (MoE), which involves specializing functionalities into different experts and sparsely activating them appropriately, has been widely adopted as a promising approach to trade-off between model capacity and computation overhead. However, current MoE variants fall short on heterogeneous datasets, ignoring the fact that experts may learn similar knowledge, resulting in the underutilization of MoE's capacity. In this paper, we propose Contrastive Representation for MoE (CoMoE), a novel method to promote modularization and specialization in MoE, where the experts are trained along with a contrastive objective by sampling from activated and inactivated experts in top-k routing. We demonstrate that such a contrastive objective recovers the mutual-information gap between inputs and the two types of experts. Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.

88. 【2505.17538】Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition

链接https://arxiv.org/abs/2505.17538

作者:Leonora Vesterbacka,Faton Rekathati,Robin Kurtz,Justyna Sikora,Agnes Toftgård

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:fine-tuned Whisper models, presents a suite, suite of fine-tuned, fine-tuned Whisper, work presents

备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:This work presents a suite of fine-tuned Whisper models for Swedish, trained on a dataset of unprecedented size and variability for this mid-resourced language. As languages of smaller sizes are often underrepresented in multilingual training datasets, substantial improvements in performance can be achieved by fine-tuning existing multilingual models, as shown in this work. This work reports an overall improvement across model sizes compared to OpenAI's Whisper evaluated on Swedish. Most notably, we report an average 47% reduction in WER comparing our best performing model to OpenAI's whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.

89. 【2505.17537】How Knowledge Popularity Influences and Enhances LLM Knowledge Boundary Perception

链接https://arxiv.org/abs/2505.17537

作者:Shiyu Ni,Keping Bi,Jiafeng Guo,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, producing confident, knowledge boundaries, popularity

备注

点击查看摘要

Abstract:Large language models (LLMs) often fail to recognize their knowledge boundaries, producing confident yet incorrect answers. In this paper, we investigate how knowledge popularity affects LLMs' ability to perceive their knowledge boundaries. Focusing on entity-centric factual question answering (QA), we quantify knowledge popularity from three perspectives: the popularity of entities in the question, the popularity of entities in the answer, and relation popularity, defined as their co-occurrence frequency. Experiments on three representative datasets containing knowledge with varying popularity show that LLMs exhibit better QA performance, higher confidence, and more accurate perception on more popular knowledge, with relation popularity having the strongest correlation. Cause knowledge popularity shows strong correlation with LLMs' QA performance, we propose to leverage these signals for confidence calibration. This improves the accuracy of answer correctness prediction by an average of 5.24% across all models and datasets. Furthermore, we explore prompting LLMs to estimate popularity without external corpora, which yields a viable alternative.

90. 【2505.17536】Multimodal Conversation Structure Understanding

链接https://arxiv.org/abs/2505.17536

作者:Kent K. Chang,Mackenzie Hanh Cramer,Anna Ho,Ti Ti Nguyen,Yilin Yuan,David Bamman

类目:Computation and Language (cs.CL)

关键词:unfold in threads, threads that break, floor or topical, speaker floor, topical focus

备注

点击查看摘要

Abstract:Conversations are usually structured by roles -- who is speaking, who's being addressed, and who's listening -- and unfold in threads that break with changes in speaker floor or topical focus. While large language models (LLMs) have shown incredible capabilities in dialogue and reasoning, their ability to understand fine-grained conversational structure, especially in multi-modal, multi-party settings, remains underexplored. To address this gap, we introduce a suite of tasks focused on conversational role attribution (speaker, addressees, side-participants) and conversation threading (utterance linking and clustering), drawing on conversation analysis and sociolinguistics. To support those tasks, we present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants. We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging. The most performant audio-visual LLM outperforms all vision-language models across all metrics, especially in speaker and addressee recognition. However, its performance drops significantly when conversation participants are anonymized. The number of conversation participants in a clip is the strongest negative predictor of role-attribution performance, while acoustic clarity (measured by pitch and spectral centroid) and detected face coverage yield positive associations. We hope this work lays the groundwork for future evaluation and development of multimodal LLMs that can reason more effectively about conversation structure.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2505.17536 [cs.CL]

(or
arXiv:2505.17536v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.17536

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
91. 【2505.17534】Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

链接https://arxiv.org/abs/2505.17534

作者:Jingjing Jiang,Chongjie Si,Jun Luo,Hanwang Zhang,Chao Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:group relative policy, large language models, simultaneously reinforcing generation, multimodal large language, relative policy optimization

备注

点击查看摘要

Abstract:This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce \textbf{CoRL}, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, \textbf{ULM-R1}, achieves average improvements of \textbf{7%} on three text-to-image generation datasets and \textbf{23%} on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs.

92. 【2505.17519】Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models

链接https://arxiv.org/abs/2505.17519

作者:Wenhan Chang,Tianqing Zhu,Yu Zhao,Shuangyong Song,Ping Xiong,Wanlei Zhou,Yongxiang Li

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:significant misusing risks, face significant misusing, large language models, language models face, models face significant

备注: 25 pages, 4 figures

点击查看摘要

Abstract:In the era of rapid generative AI development, interactions between humans and large language models face significant misusing risks. Previous research has primarily focused on black-box scenarios using human-guided prompts and white-box scenarios leveraging gradient-based LLM generation methods, neglecting the possibility that LLMs can act not only as victim models, but also as attacker models to harm other models. We proposes a novel jailbreaking method inspired by the Chain-of-Thought mechanism, where the attacker model uses mission transfer to conceal harmful user intent in dialogue and generates chained narrative lures to stimulate the reasoning capabilities of victim models, leading to successful jailbreaking. To enhance the attack success rate, we introduce a helper model that performs random narrative optimization on the narrative lures during multi-turn dialogues while ensuring alignment with the original intent, enabling the optimized lures to bypass the safety barriers of victim models effectively. Our experiments reveal that models with weaker safety mechanisms exhibit stronger attack capabilities, demonstrating that models can not only be exploited, but also help harm others. By incorporating toxicity scores, we employ third-party models to evaluate the harmfulness of victim models' responses to jailbreaking attempts. The study shows that using refusal keywords as an evaluation metric for attack success rates is significantly flawed because it does not assess whether the responses guide harmful questions, while toxicity scores measure the harm of generated content with more precision and its alignment with harmful questions. Our approach demonstrates outstanding performance, uncovering latent vulnerabilities in LLMs and providing data-driven feedback to optimize LLM safety mechanisms. We also discuss two defensive strategies to offer guidance on improving defense mechanisms.

93. 【2505.17513】What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

链接https://arxiv.org/abs/2505.17513

作者:Binh Nguyen,Shuji Shi,Ryan Ofman,Thai Le

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:realistic voice generation, enabled realistic voice, Recent advances, fueling audio-based deepfake, technologies have enabled

备注: 15 pages, 2 fogures

点击查看摘要

Abstract:Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing transcript-level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open-source detector-voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that both linguistic complexity and model-level audio embedding similarity contribute strongly to detector vulnerability. We further demonstrate the real-world risk via a case study replicating the Brad Pitt audio deepfake scam, using transcript adversarial attacks to completely bypass commercial detectors. These results highlight the need to move beyond purely acoustic defenses and account for linguistic variation in the design of robust anti-spoofing systems. All source code will be publicly available.

94. 【2505.17512】Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

链接https://arxiv.org/abs/2505.17512

作者:Shuhang Xu,Weijian Deng,Yixuan Zhou,Fangwei Zhong

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:extent Large Language, Large Language Models, Large Language, represent generalized abstractions, Concepts represent generalized

备注: 9 pages

点击查看摘要

Abstract:Concepts represent generalized abstractions that enable humans to categorize and reason efficiently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs' understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with parameter size or general model capabilities. The data and code are available at the project homepage: this https URL.

95. 【2505.17510】Large Language Models Do Multi-Label Classification Differently

链接https://arxiv.org/abs/2505.17510

作者:Marcus Ma,Georgios Chochlakis,Niyantha Maruthu Pandiyan,Jesse Thomason,Shrikanth Narayanan

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Multi-label classification, perform multi-label classification, prevalent in real-world

备注: 18 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, with a focus on subjective tasks, by analyzing the output distributions of the models in each generation step. We find that their predictive behavior reflects the multiple steps in the underlying language modeling required to generate all relevant labels as they tend to suppress all but one label at each step. We further observe that as model scale increases, their token distributions exhibit lower entropy, yet the internal ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. To further study this issue, we introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches.

96. 【2505.17508】On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

链接https://arxiv.org/abs/2505.17508

作者:Yifan Zhang,Yifeng Liu,Huizhuo Yuan,Yang Yuan,Quanquan Gu,Andrew C Yao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, Policy gradient algorithms, Policy gradient, language models, successfully applied

备注: 53 pages, 17 figures

点击查看摘要

Abstract:Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at this https URL.

97. 【2505.17505】L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

链接https://arxiv.org/abs/2505.17505

作者:Xiaohao Liu,Xiaobo Xia,Weixiang Zhao,Manyi Zhang,Xianzhi Yu,Xiu Su,Shuo Yang,See-Kiong Ng,Tat-Seng Chua

类目:Computation and Language (cs.CL)

关键词:Large language models, achieved notable progress, Large language, notable progress, achieved notable

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code will be publicly available.

98. 【2505.17503】CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

链接https://arxiv.org/abs/2505.17503

作者:Minsoo Khang,Sangjun Park,Teakgyu Hong,Dawoon Jung

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, made substantial progress, scenarios remains challenging, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: this https URL.

99. 【2505.17496】Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

链接https://arxiv.org/abs/2505.17496

作者:Chi-Yuan Hsiao,Ke-Han Lu,Kai-Wei Chang,Chih-Kai Yang,Wei-Chih Chen,Hung-yi Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Large Language Models, Spoken Language Models, text-based Large Language, Language Models, pre-trained text-based Large

备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.

100. 【2505.17495】ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

链接https://arxiv.org/abs/2505.17495

作者:Landon Butler,Abhineet Agarwal,Justin Singh Kang,Yigit Efe Erginbas,Bin Yu,Kannan Ramchandran

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:achieved remarkable performance, Large Language Models, Large Language, Language Models, capturing complex interactions

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using $10\times$ fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.

101. 【2505.17492】PD$^3$: A Project Duplication Detection Framework via Adapted Multi-Agent Debate

链接https://arxiv.org/abs/2505.17492

作者:Dezheng Bao,Yueci Yang,Xin Chen,Zhengxuan Jiang,Zeguo Fei,Daoze Zhang,Xuanwen Huang,Junru Chen,Chutian Yu,Xiang Yuan,Yang Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:resource utilization efficiency, project quality assessment, improves resource utilization, Project duplication detection, quality assessment

备注: 17 pages, 9 figures

点击查看摘要

Abstract:Project duplication detection is critical for project quality assessment, as it improves resource utilization efficiency by preventing investing in newly proposed project that have already been studied. It requires the ability to understand high-level semantics and generate constructive and valuable feedback. Existing detection methods rely on basic word- or sentence-level comparison or solely apply large language models, lacking valuable insights for experts and in-depth comprehension of project content and review criteria. To tackle this issue, we propose PD$^3$, a Project Duplication Detection framework via adapted multi-agent Debate. Inspired by real-world expert debates, it employs a fair competition format to guide multi-agent debate to retrieve relevant projects. For feedback, it incorporates both qualitative and quantitative analysis to improve its practicality. Over 800 real-world power project data spanning more than 20 specialized fields are used to evaluate the framework, demonstrating that our method outperforms existing approaches by 7.43% and 8.00% in two downstream tasks. Furthermore, we establish an online platform, Review Dingdang, to assist power experts, saving 5.73 million USD in initial detection on more than 100 newly proposed projects.

102. 【2505.17485】keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection

链接https://arxiv.org/abs/2505.17485

作者:Saketh Reddy Vemula,Parameswari Krishnamurthy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multilingual Shared Task, Related Observable Over-generation, model generated text, Observable Over-generation Errors, real world

备注

点击查看摘要

Abstract:Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM-a Multilingual Shared Task on Hallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior.

103. 【2505.17482】From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark

链接https://arxiv.org/abs/2505.17482

作者:Chao Lei,Nir Lipovetzky,Krista A. Ehinger,Yanchuan Chang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent reasoning-oriented LLMs, science examinations, Recent reasoning-oriented, mathematics and science, abstract reasoning

备注

点击查看摘要

Abstract:Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.

104. 【2505.17481】MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

链接https://arxiv.org/abs/2505.17481

作者:Yusheng Zhao,Xiao Luo,Weizhi Zhang,Wei Ju,Zhiping Xiao,Philip S. Yu,Ming Zhang

类目:Computation and Language (cs.CL)

关键词:large language models, enabling a wide, ability to reason, fundamental capabilities, capabilities of large

备注

点击查看摘要

Abstract:The ability to reason is one of the most fundamental capabilities of large language models (LLMs), enabling a wide range of downstream tasks through sophisticated problem-solving. A critical aspect of this is code reasoning, which involves logical reasoning with formal languages (i.e., programming code). In this paper, we enhance this capability of LLMs by exploring the following question: how can an LLM agent become progressively smarter in code reasoning with each solution it proposes, thereby achieving substantial cumulative improvement? Most existing research takes a static perspective, focusing on isolated problem-solving using frozen LLMs. In contrast, we adopt a cognitive-evolving perspective and propose a novel framework named Meta-Reflection with Cross-Referencing (MARCO) that enables the LLM to evolve dynamically during inference through self-improvement. From the perspective of human cognitive development, we leverage both knowledge accumulation and lesson sharing. In particular, to accumulate knowledge during problem-solving, we propose meta-reflection that reflects on the reasoning paths of the current problem to obtain knowledge and experience for future consideration. Moreover, to effectively utilize the lessons from other agents, we propose cross-referencing that incorporates the solution and feedback from other agents into the current problem-solving process. We conduct experiments across various datasets in code reasoning, and the results demonstrate the effectiveness of MARCO.

105. 【2505.17473】OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics

链接https://arxiv.org/abs/2505.17473

作者:Jiangning Zhu,Yuxing Zhou,Zheng Wang,Juntao Yao,Yima Gu,Yuhui Yuan,Shixia Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:chart understanding capabilities, communication contexts, increasingly critical, central role, capabilities of vision-language

备注

点击查看摘要

Abstract:Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce OrionBench, a benchmark designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 26,250 real and 78,750 synthetic infographics, with over 6.9 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of OrionBench through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

106. 【2505.17471】FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

链接https://arxiv.org/abs/2505.17471

作者:Suifeng Zhao,Zhuoran Jin,Sujian Li,Jun Gao

类目:Computation and Language (cs.CL)

关键词:real-time market analysis, interest rate computation, trend forecasting, plays a vital, powering applications

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance which effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.

107. 【2505.17470】SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models

链接https://arxiv.org/abs/2505.17470

作者:Xiang Liu,Zhaoxiang Liu,Peng Wang,Kohou Wang,Huan Hu,Kai Wang,Shiguo Lian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:SFT dataset, adapt large language, entire SFT dataset, large language models, significant challenge arises

备注: 12 pages, 5 figures

点击查看摘要

Abstract:When using supervised fine-tuning (SFT) to adapt large language models (LLMs) to specific domains, a significant challenge arises: should we use the entire SFT dataset for fine-tuning? Common practice often involves fine-tuning directly on the entire dataset due to limited information on the LLM's past training data. However, if the SFT dataset largely overlaps with the model's existing knowledge, the performance gains are minimal, leading to wasted computational resources. Identifying the unknown knowledge within the SFT dataset and using it to fine-tune the model could substantially improve the training efficiency. To address this challenge, we propose a self-learning framework for LLMs inspired by human learning pattern. This framework takes a fine-tuning (SFT) dataset in a specific domain as input. First, the LLMs answer the questions in the SFT dataset. The LLMs then objectively grade the responses and filter out the incorrectly answered QA pairs. Finally, we fine-tune the LLMs based on this filtered QA set. Experimental results in the fields of agriculture and medicine demonstrate that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning. By concentrating on the unknown knowledge within the SFT dataset, our approach enhances the efficiency of fine-tuning LLMs.

108. 【2505.17465】A Position Paper on the Automatic Generation of Machine Learning Leaderboards

链接https://arxiv.org/abs/2505.17465

作者:Roelien C Timmer,Yufang Hou,Stephen Wan

类目:Computation and Language (cs.CL); Methodology (stat.ME)

关键词:machine learning, comparable conditions, experiments with comparable, comparing prior work, Automatic Leaderboard Generation

备注

点击查看摘要

Abstract:An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.

109. 【2505.17464】Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

链接https://arxiv.org/abs/2505.17464

作者:Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang

类目:Computation and Language (cs.CL)

关键词:enhances large language, Retrieval-augmented generation, incorporating external knowledge, large language models, enhances large

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present Hydra, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. Hydra handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, Hydra uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, Hydra fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that Hydra achieves overall state-of-the-art results on all benchmarks with GPT-3.5, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, Hydra enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo.

110. 【2505.17461】Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies

链接https://arxiv.org/abs/2505.17461

作者:Kazuki Hayashi,Shintaro Ozaki,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large-scale Vision Language, Large-scale Vision, real-world multimodal applications, involving complex visual, linguistic reasoning

备注

点击查看摘要

Abstract:Large-scale Vision Language Models (LVLMs) are increasingly being applied to a wide range of real-world multimodal applications, involving complex visual and linguistic reasoning. As these models become more integrated into practical use, they are expected to handle complex aspects of human interaction. Among these, color perception is a fundamental yet highly variable aspect of visual understanding. It differs across individuals due to biological factors such as Color Vision Deficiencies (CVDs), as well as differences in culture and language. Despite its importance, perceptual diversity has received limited attention. In our study, we evaluate LVLMs' ability to account for individual level perceptual variation using the Ishihara Test, a widely used method for detecting CVDs. Our results show that LVLMs can explain CVDs in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks. These findings highlight the need for multimodal systems that can account for color perceptual diversity and support broader discussions on perceptual inclusiveness and fairness in multimodal AI.

111. 【2505.17455】owards Evaluating Proactive Risk Awareness of Multimodal Language Models

链接https://arxiv.org/abs/2505.17455

作者:Youliang Yuan,Wenxiang Jiao,Yuejin Xie,Chihao Shen,Menghan Tian,Wenxuan Wang,Jen-tse Huang,Pinjia He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Human safety awareness, safety awareness gaps, Human safety, awareness gaps, timely recognition

备注: Work in progress

点击查看摘要

Abstract:Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at this https URL.

112. 【2505.17454】Self-Training Large Language Models with Confident Reasoning

链接https://arxiv.org/abs/2505.17454

作者:Hyosoon Jang,Yunhui Jang,Sungjae Lee,Jungseul Ok,Sungsoo Ahn

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language models, costly human supervision, shown impressive performance, requires costly human, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive performance by generating reasoning paths before final answers, but learning such a reasoning path requires costly human supervision. To address this issue, recent studies have explored self-training methods that improve reasoning capabilities using pseudo-labels generated by the LLMs themselves. Among these, confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, where confidence is estimated via majority voting. However, such methods exclusively focus on the quality of the final answer and may ignore the quality of the reasoning paths, as even an incorrect reasoning path leads to a correct answer by chance. Instead, we advocate the use of reasoning-level confidence to identify high-quality reasoning paths for self-training, supported by our empirical observations. We then propose a new self-training method, CORE-PO, that fine-tunes LLMs to prefer high-COnfidence REasoning paths through Policy Optimization. Our experiments show that CORE-PO improves the accuracy of outputs on four in-distribution and two out-of-distribution benchmarks, compared to existing self-training methods.

113. 【2505.17447】LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization

链接https://arxiv.org/abs/2505.17447

作者:Qi Zhang,Shouqing Yang,Lirong Gao,Hao Chen,Xiaomeng Hu,Jinglei Chen,Jiexiang Wang,Sheng Guo,Bo Zheng,Haobo Wang,Junbo Zhao

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, demonstrated impressive capabilities, language models, demonstrated impressive

备注: preprint, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.

114. 【2505.17446】Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

链接https://arxiv.org/abs/2505.17446

作者:Shunsuke Kando,Yusuke Miyao,Shinnosuke Takamichi

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:speech tokenization, speech, sequence of discrete, SLMs remains unclear, tokenization

备注: Accepted to Interspeech2025

点击查看摘要

Abstract:The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.

115. 【2505.17441】Discovering Forbidden Topics in Language Models

链接https://arxiv.org/abs/2505.17441

作者:Can Rager,Chris Wendler,Rohit Gandikota,David Bau

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:language model refuses, refuses to discuss, Refusal discovery, task of identifying, identifying the full

备注

点击查看摘要

Abstract:Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits "thought suppression" behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.

116. 【2505.17427】$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

链接https://arxiv.org/abs/2505.17427

作者:Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Huimin Wang,Yutian Zhao,Bin Liang,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Contextual Question Answering, Language Models, Large Language, demonstrated remarkable performance

备注

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models' inherent reasoning capabilities. To address these limitations, we present T$^2$: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T$^2$ leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T$^2$ works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T$^2$ not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2\%.

117. 【2505.17425】Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

链接https://arxiv.org/abs/2505.17425

作者:Wei Jie Yeo,Rui Mao,Moloud Abdar,Erik Cambria,Ranjan Satapathy

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal models, gained significant attention, significant attention due, remarkable zero-shot performance, gained significant

备注: Under review

点击查看摘要

Abstract:Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at this https URL.

118. 【2505.17420】DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

链接https://arxiv.org/abs/2505.17420

作者:Ning Yang,Fangxin Liu,Junjie Wang,Tao Yang,Kan Liu,Haibing Guan,Li Jiang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, achieved remarkable performance, achieved remarkable, wide range

备注: 8 pages,5 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.

119. 【2505.17413】Conversations: Love Them, Hate Them, Steer Them

链接https://arxiv.org/abs/2505.17413

作者:Niranjan Chebrolu,Gerard Christopher Yeo,Kokil Jaidka

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, increasing conversational fluency, significant challenge

备注: 11 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable method for controlling specific emotional attributes in LLMs, contributing to developing more aligned and empathetic conversational AI.

120. 【2505.17410】LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context

链接https://arxiv.org/abs/2505.17410

作者:Natsuo Yamashita,Masaaki Yamamoto,Hiroaki Kokubo,Yohei Kawaguchi

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:automatic speech recognition, Generative error correction, Generative error, large language models, effective post-processing approach

备注: Accepted by INTERSPEECH 2025

点击查看摘要

Abstract:Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.

121. 【2505.17407】Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

链接https://arxiv.org/abs/2505.17407

作者:Zhi Rui Tam,Cheng-Kuang Wu,Yu Ying Chiu,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, internal reasoning processes, demonstrated impressive performance, Large reasoning, demonstrated impressive

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.

122. 【2505.17399】FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

链接https://arxiv.org/abs/2505.17399

作者:Haoyu Sun,Huichen Will Wang,Jiawei Gu,Linjie Li,Yu Cheng

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, engineers conceptualize designs, Large Language Models, Front-end engineering involves, evaluate Multimodal Large

备注

点击查看摘要

Abstract:Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in this https URL.

123. 【2505.17391】Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

链接https://arxiv.org/abs/2505.17391

作者:Yuelyu Ji,Rui Meng,Zhuochun Li,Daqing He

类目:Computation and Language (cs.CL)

关键词:grounds large language, issue redundant subqueries, Retrieval-augmented generation, large language models, long search chains

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.

124. 【2505.17390】Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

链接https://arxiv.org/abs/2505.17390

作者:Gauri Kambhatla,Chantal Shaib,Venkata Govindarajan

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, fine-tuning of Large, Language Models, data for pre-training

备注

点击查看摘要

Abstract:Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.

125. 【2505.17387】WiNGPT-3.0 Technical Report

链接https://arxiv.org/abs/2505.17387

作者:Boqin Zhuang,Chenxiao Song,Huitong Lu,Jiacheng Qiao,Mingqian Liu,Mingxing Yu,Ping Hong,Rui Li,Xiaoxia Song,Xiangjun Xu,Xu Chen,Yaoyao Ma,Yujie Gao

类目:Computation and Language (cs.CL)

关键词:Current Large Language, Large Language Models, Current Large, exhibit significant limitations, Large Language

备注

点击查看摘要

Abstract:Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning and exploring its potential for effective integration within healthcare IT infrastructures. The broader aim is to advance towards clinically applicable models. The approach involved a multi-stage training pipeline tailored for general, medical, and clinical reasoning. This pipeline incorporated supervised fine-tuning (SFT) and reinforcement learning (RL), leveraging curated Long Chain-of-Thought (CoT) datasets, auxiliary reward models, and an evidence-based diagnostic chain simulation. WiNGPT-3.0 demonstrated strong performance: specific model variants achieved scores of 66.6 on MedCalc and 87.1 on MedQA-USMLE. Furthermore, targeted training improved performance on a clinical reasoning task from a baseline score of 58.1 to 62.5. These findings suggest that reinforcement learning, even when applied with a limited dataset of only a few thousand examples, can enhance medical reasoning accuracy. Crucially, this demonstration of RL's efficacy with limited data and computation paves the way for more trustworthy and practically deployable LLMs within clinical workflows and health information infrastructures.

126. 【2505.17380】AI-Augmented LLMs Achieve Therapist-Level Responses in Motivational Interviewing

链接https://arxiv.org/abs/2505.17380

作者:Yinghui Huang,Yuxuan Jiang,Hui Liu,Yixin Cai,Weiqing Li,Xiangen Hu

类目:Computation and Language (cs.CL)

关键词:scaling motivational interviewing, Large language models, require systematic evaluation, Large language, motivational interviewing

备注: 21 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) like GPT-4 show potential for scaling motivational interviewing (MI) in addiction care, but require systematic evaluation of therapeutic capabilities. We present a computational framework assessing user-perceived quality (UPQ) through expected and unexpected MI behaviors. Analyzing human therapist and GPT-4 MI sessions via human-AI collaboration, we developed predictive models integrating deep learning and explainable AI to identify 17 MI-consistent (MICO) and MI-inconsistent (MIIN) behavioral metrics. A customized chain-of-thought prompt improved GPT-4's MI performance, reducing inappropriate advice while enhancing reflections and empathy. Although GPT-4 remained marginally inferior to therapists overall, it demonstrated superior advice management capabilities. The model achieved measurable quality improvements through prompt engineering, yet showed limitations in addressing complex emotional nuances. This framework establishes a pathway for optimizing LLM-based therapeutic tools through targeted behavioral metric analysis and human-AI co-evaluation. Findings highlight both the scalability potential and current constraints of LLMs in clinical communication applications.

127. 【2505.17374】Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts

链接https://arxiv.org/abs/2505.17374

作者:Seon Gyeom Kim,Jae Young Choi,Ryan Rossi,Eunyee Koh,Tak Yeon Lee

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注: This paper has been accepted to IEEE PacificVis 2025

点击查看摘要

Abstract:The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.

128. 【2505.17373】Value-Guided Search for Efficient Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2505.17373

作者:Kaiwen Wang,Jin Peng Zhou,Jonathan Chang,Zhaolin Gao,Nathan Kallus,Kianté Brantley,Wen Sun

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:long-context reasoning traces, propose a simple, simple and efficient, reasoning traces, long-context reasoning

备注

点击查看摘要

Abstract:In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 2025, HMMT Feb 2024 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

129. 【2505.17371】An End-to-End Approach for Child Reading Assessment in the Xhosa Language

链接https://arxiv.org/abs/2505.17371

作者:Sergio Chevtchenko,Nikhil Navas,Rafaella Vale,Franco Ubaudi,Sipumelele Lucwaba,Cally Ardington,Soheil Afshar,Mark Antoniou,Saeed Afshar

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:life outcomes, individual life, strong predictor, subsequent stages, Grade Reading Assessment

备注: Paper accepted on AIED 2025 containing 14 pages, 6 figures and 4 tables

点击查看摘要

Abstract:Child literacy is a strong predictor of life outcomes at the subsequent stages of an individual's life. This points to a need for targeted interventions in vulnerable low and middle income populations to help bridge the gap between literacy levels in these regions and high income ones. In this effort, reading assessments provide an important tool to measure the effectiveness of these programs and AI can be a reliable and economical tool to support educators with this task. Developing accurate automatic reading assessment systems for child speech in low-resource languages poses significant challenges due to limited data and the unique acoustic properties of children's voices. This study focuses on Xhosa, a language spoken in South Africa, to advance child speech recognition capabilities. We present a novel dataset composed of child speech samples in Xhosa. The dataset is available upon request and contains ten words and letters, which are part of the Early Grade Reading Assessment (EGRA) system. Each recording is labeled with an online and cost-effective approach by multiple markers and a subsample is validated by an independent EGRA reviewer. This dataset is evaluated with three fine-tuned state-of-the-art end-to-end models: wav2vec 2.0, HuBERT, and Whisper. The results indicate that the performance of these models can be significantly influenced by the amount and balancing of the available training data, which is fundamental for cost-effective large dataset collection. Furthermore, our experiments indicate that the wav2vec 2.0 performance is improved by training on multiple classes at a time, even when the number of available samples is constrained.

130. 【2505.17362】A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit

链接https://arxiv.org/abs/2505.17362

作者:Zafarullah Mahmood,Soliman Ali,Jiading Zhu,Mohamed Abdelwahab,Michelle Yu Collins,Sihan Chen,Yi Cheng Zhao,Jodi Wolff,Osnat Melamed,Nadia Minian,Marta Maslej,Carolynne Cooper,Matt Ratto,Peter Selby,Jonathan Rose

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, capabilities of Large, Language Models, Large Language, conversational capabilities

备注: To be published in the Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), Vienna, Austria, 2025

点击查看摘要

Abstract:The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot's adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants' confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants' language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.

131. 【2505.17348】DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

链接https://arxiv.org/abs/2505.17348

作者:Yuheng Wu,Jianwen Xie,Denghui Zhang,Zhaozhuo Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:perform deep social, deep social reasoning, Dynamic Epistemic Logic, pose a unique, unique challenge

备注

点击查看摘要

Abstract:Theory-of-Mind (ToM) tasks pose a unique challenge for small language models (SLMs) with limited scale, which often lack the capacity to perform deep social reasoning. In this work, we propose DEL-ToM, a framework that improves ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and transparent reasoning. We train a verifier, called the Process Belief Model (PBM), to score each belief update step using labels generated automatically via a DEL simulator. During inference, candidate belief traces generated by a language model are evaluated by the PBM, and the highest-scoring trace is selected. This allows SLMs to emulate more deliberate reasoning by allocating additional compute at test time. Experiments across multiple model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision can significantly enhance ToM abilities of SLMs without retraining.

132. 【2505.17345】Language models should be subject to repeatable, open, domain-contextualized hallucination benchmarking

链接https://arxiv.org/abs/2505.17345

作者:Justin D. Norman,Michael U. Rivera,D. Alex Hughes

类目:Computation and Language (cs.CL)

关键词:tokens in model-generated, model-generated text, text are widely, widely believed, pervasive and problematic

备注: 9 pages

点击查看摘要

Abstract:Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

133. 【2505.17332】SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

链接https://arxiv.org/abs/2505.17332

作者:Hitesh Laxmichand Patel,Amit Agarwal,Arion Das,Bhargava Kumar,Srikant Panda,Priyaranjan Pattnayak,Taki Hasan Rafi,Tejaswini Kumar,Dong-Kyu Chae

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:increasingly adopting Large, crafting sales pitches, adopting Large Language, composing casual messages, Large Language Models

备注: Published in the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025), Industry Track, pages 558-582

点击查看摘要

Abstract:Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: this https URL.

134. 【2505.17331】ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

链接https://arxiv.org/abs/2505.17331

作者:Maryam Dialameh,Rezaul Karim,Hossein Rajabzadeh,Omar Mohamed Awad,Hyock Ju Kwon,Boxing Chen,Walid Ahmed,Yang Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:LLaMA architecture designed, paper introduces ECHO-LLaMA, learning capacity, architecture designed, efficient LLaMA architecture

备注

点击查看摘要

Abstract:This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

135. 【2505.17330】FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

链接https://arxiv.org/abs/2505.17330

作者:Amit Agarwal,Srikant Panda,Kulbhushan Pachauri

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Domain Adapting Graph, Shot Domain Adapting, Adapting Graph, rich document understanding, visually rich document

备注: Published in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Industry Track, pages 100-114

点击查看摘要

Abstract:In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : this https URL

136. 【2505.17327】GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints

链接https://arxiv.org/abs/2505.17327

作者:Soren DeHaan,Yuanze Liu,Johan Bollen,Sa'ul A. Blanco

类目:Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)

关键词:Large Language Models, causing institutional uncertainty, impacted academic writing, proliferation of Large, threatening credibility

备注: 13 pages

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) in late 2022 has impacted academic writing, threatening credibility, and causing institutional uncertainty. We seek to determine the degree to which LLMs are used to generate critical text as opposed to being used for editing, such as checking for grammar errors or inappropriate phrasing. In our study, we analyze arXiv papers for stylistic segmentation, which we measure by varying a PELT threshold against a Bayesian classifier trained on GPT-regenerated text. We find that LLM-attributed language is not predictive of stylistic segmentation, suggesting that when authors use LLMs, they do so uniformly, reducing the risk of hallucinations being introduced into academic preprints.

137. 【2505.17322】From Compression to Expansion: A Layerwise Analysis of In-Context Learning

链接https://arxiv.org/abs/2505.17322

作者:Jiachen Jiang,Yuxin Dong,Jinxin Zhou,Zhihui Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:enables large language, In-context learning, large language models, enables large, large language

备注

点击查看摘要

Abstract:In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term *Layerwise Compression-Expansion*: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.

138. 【2505.17320】Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

链接https://arxiv.org/abs/2505.17320

作者:Zackary Rackauckas,Julia Hirschberg

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Synthesizing expressive Japanese, poses unique challenges, unique challenges due, speech poses unique, Synthesizing expressive

备注

点击查看摘要

Abstract:Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper benchmarks two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.

139. 【2505.17316】Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

链接https://arxiv.org/abs/2505.17316

作者:Jiachen Jiang,Jinxin Zhou,Bo Peng,Xia Ning,Zhihui Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, pretrained vision encoders, powerful pretrained vision, vision encoder

备注

点击查看摘要

Abstract:Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment -- the alignment between each vision patch and its corresponding semantic words -- and propose a *multi-semantic alignment hypothesis*. Our analysis indicates that the projector trained by caption loss improves patch-level alignment but only to a limited extent, resulting in weak and coarse alignment. To address this issue, we propose *patch-aligned training* to efficiently enhance patch-level alignment. Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM's performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.

140. 【2505.17315】Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

链接https://arxiv.org/abs/2505.17315

作者:Wang Yang,Zirui Liu,Hongye Jin,Qingyu Yin,Vipin Chaudhary,Xiaotian Han

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent language models, strong reasoning capabilities, reasoning remains underexplored, models exhibit strong, exhibit strong reasoning

备注

点击查看摘要

Abstract:Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

141. 【2505.17306】Refusal Direction is Universal Across Safety-Aligned Languages

链接https://arxiv.org/abs/2505.17306

作者:Xinpeng Wang,Mingyang Wang,Yihong Liu,Hinrich Schütze,Barbara Plank

类目:Computation and Language (cs.CL)

关键词:large language models, essential for ensuring, Refusal, refusal behavior, language models

备注

点击查看摘要

Abstract:Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

142. 【2505.17296】SELF: Self-Extend the Context Length With Logistic Growth Function

链接https://arxiv.org/abs/2505.17296

作者:Phat Thanh Dang,Saahil Thoppay,Wang Yang,Qifan Wang,Vipin Chaudhary,Xiaotian Han

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, standard position encoding, Large language, training context length, context length due

备注: 11 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at this https URL.

143. 【2505.17282】Attention with Trained Embeddings Provably Selects Important Tokens

链接https://arxiv.org/abs/2505.17282

作者:Diyuan Wu,Aleksandr Shevchenko,Samet Oymak,Marco Mondelli

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:understanding remains limited, theoretical understanding remains, Token embeddings play, practical relevance, remains limited

备注

点击查看摘要

Abstract:Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.

144. 【2505.17281】Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

链接https://arxiv.org/abs/2505.17281

作者:Peilin Wu,Mian Zhang,Xinlu Zhang,Xinya Du,Zhiyu Zoey Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enhance Large Language, Large Language Models, Agentic Retrieval-Augmented Generation, Large Language, systems enhance Large

备注

点击查看摘要

Abstract:Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $\beta$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $\beta$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

145. 【2505.17272】Zebra-Llama: Towards Extremely Efficient Hybrid Models

链接https://arxiv.org/abs/2505.17272

作者:Mingyu Yang,Mehdi Rezagholizadeh,Guihong Li,Vikram Appia,Emad Barsoum

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:deploying large language, large language models, diverse applications, improving their inference, democratized access

备注

点击查看摘要

Abstract:With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.

146. 【2505.17267】GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

链接https://arxiv.org/abs/2505.17267

作者:Odysseas S. Chlapanis,Dimitrios Galanis,Nikolaos Aletras,Ion Androutsopoulos

类目:Computation and Language (cs.CL)

关键词:Greek Bar exams, Greek Bar, Bar exams, introduce GreekBarBench, requiring citations

备注: 19 pages, 17 figures, submitted to May ARR

点击查看摘要

Abstract:We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.

147. 【2505.17266】Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

链接https://arxiv.org/abs/2505.17266

作者:Cehao Yang,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Xiaojun Wu,Honghao Liu,Hui Xiong,Jian Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, pre-trained large language, Large Reasoning Models, strong Large Reasoning, pre-trained large

备注

点击查看摘要

Abstract:A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

148. 【2505.17265】CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

链接https://arxiv.org/abs/2505.17265

作者:Xiao Yu Cindy Zhang(1),Carlos R. Ferreira(2),Francis Rossignol(2),Raymond T. Ng(1),Wyeth Wasserman(1),Jian Zhu(1) ((1) University of British Columbia, (2) National Institutes of Health)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:including Inborn Errors, Errors of Metabolism, Inborn Errors, pose significant diagnostic, significant diagnostic challenges

备注

点击查看摘要

Abstract:Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs. Using this dataset, we assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought prompting offers little advantage over standard zero-shot prompting. Category-specific prompting improves alignment with the benchmark. The open-source model Qwen2.5-7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings important for differential diagnosis. This work advances LLM-driven clinical natural language processing and paves the way for scalable medical AI applications.

149. 【2505.17260】he Rise of Parameter Specialization for Knowledge Storage in Large Language Models

链接https://arxiv.org/abs/2505.17260

作者:Yihuai Hong,Yiran Zhao,Wei Tang,Yang Deng,Yu Rong,Wenxuan Zhang

类目:Computation and Language (cs.CL)

关键词:language models, growing wave, large language models, knowledge, language

备注

点击查看摘要

Abstract:Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model's efficiency in leveraging stored knowledge.

150. 【2505.17250】ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

链接https://arxiv.org/abs/2505.17250

作者:Razvan-Gabriel Dumitru,Darius Peteleaza,Vikas Yadav,Liangming Pan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:structured reasoning steps, excel at complex, complex tasks, tasks by breaking, Large language

备注: 25 pages, 18 figures, and 6 tables

点击查看摘要

Abstract:Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at this https URL.

151. 【2505.17244】ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models

链接https://arxiv.org/abs/2505.17244

作者:Changyi Li,Jiayi Wang,Xudong Pan,Geng Hong,Min Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Reasoning Models, advanced reasoning capabilities, Large Reasoning, landscape with advanced, reasoning traces

备注

点击查看摘要

Abstract:Large Reasoning Models (LRMs) are transforming the AI landscape with advanced reasoning capabilities. While the generated reasoning traces enhance model transparency, they can still contain unsafe content, even when the final answer appears safe. Existing moderation tools, primarily designed for question-answer (QA) pairs, are empirically ineffective at detecting hidden risks embedded in reasoning traces. After identifying the key challenges, we formally define the question-thought (QT) moderation task and propose ReasoningShield, the first safety detection model tailored to identify potential risks in the reasoning trace before reaching the final answer. To construct the model, we synthesize a high-quality reasoning safety detection dataset comprising over 8,000 question-thought pairs spanning ten risk categories and three safety levels. Our dataset construction process incorporates a comprehensive human-AI collaborative annotation pipeline, which achieves over 93% annotation accuracy while significantly reducing human costs. On a diverse set of in-distribution and out-of-distribution benchmarks, ReasoningShield outperforms mainstream content safety moderation models in identifying risks within reasoning traces, with an average F1 score exceeding 0.92. Notably, despite being trained on our QT dataset only, ReasoningShield also demonstrates competitive performance in detecting unsafe question-answer pairs on traditional benchmarks, rivaling baselines trained on 10 times larger datasets and base models, which strongly validates the quality of our dataset. Furthermore, ReasoningShield is built upon compact 1B/3B base models to facilitate lightweight deployment and provides human-friendly risk analysis by default. To foster future research, we publicly release all the resources.

152. 【2505.17238】Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)

链接https://arxiv.org/abs/2505.17238

作者:Clayton Cohn,Surya Rayala,Caitlin Snyder,Joyce Fonteles,Shruti Jain,Naveeduddin Mohammed,Umesh Timalsina,Sarah K. Burriss,Ashwin T S,Namrata Srivastava,Menton Deweese,Angela Eeds,Gautam Biswas

类目:Computation and Language (cs.CL)

关键词:offers rich insights, dialogue offers rich, offers rich, rich insights, students' learning

备注: Submitted to the International Conference on Artificial Intelligence in Education (AIED) Workshop on Epistemics and Decision-Making in AI-Supported Education

点击查看摘要

Abstract:Collaborative dialogue offers rich insights into students' learning and critical thinking. This is essential for adapting pedagogical agents to students' learning and problem-solving skills in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, potential hallucinations can undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated knowledge, but its effectiveness depends on clear semantic links between user input and a knowledge base, which are often weak in student dialogue. We propose log-contextualized RAG (LC-RAG), which enhances RAG retrieval by incorporating environment logs to contextualize collaborative discourse. Our findings show that LC-RAG improves retrieval over a discourse-only baseline and allows our collaborative peer agent, Copa, to deliver relevant, personalized guidance that supports students' critical thinking and epistemic decision-making in a collaborative computational modeling environment, XYZ.

153. 【2505.17235】CHAOS: Chart Analysis with Outlier Samples

链接https://arxiv.org/abs/2505.17235

作者:Omar Moured,Yufan Chen,Ruiping Liu,Simon Reiß,Philip Torr,Jiaming Zhang,Rainer Stiefelhagen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, noisy features, real-world applications, applications often present

备注: Data and code are publicly available at: [this http URL](http://huggingface.co/datasets/omoured/CHAOS)

点击查看摘要

Abstract:Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, "outlier charts" pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: this http URL.

154. 【2505.17231】ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

链接https://arxiv.org/abs/2505.17231

作者:Jipeng Zhang,Haolin Yang,Kehao Miao,Ruiyuan Zhang,Renjie Pi,Jiahui Gao,Xiaofang Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:achieved strong performance, effectiveness remains largely, remains largely confined, strong performance, achieved strong

备注

点击查看摘要

Abstract:Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

155. 【2505.17222】Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts

链接https://arxiv.org/abs/2505.17222

作者:Georgios Chochlakis,Peter Wu,Arjun Bedi,Marcus Ma,Kristina Lerman,Shrikanth Narayanan

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Modeling complex subjective, considerably challenging due, Language Processing, Modeling complex

备注: 17 pages, 16 figures, 9 tables

点击查看摘要

Abstract:Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at this https URL.

156. 【2505.17217】Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs

链接https://arxiv.org/abs/2505.17217

作者:Kangda Wei,Hasnat Md Abdullah,Ruihong Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Large Language Models, Large Language, resulting in unequal, Language Models, unequal treatment

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data.

157. 【2505.17206】FB-RAG: Improving RAG with Forward and Backward Lookup

链接https://arxiv.org/abs/2505.17206

作者:Kushal Chawla,Alfy Samuel,Anoop Kumar,Daben Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, systems relies heavily, systems relies

备注

点击查看摘要

Abstract:The performance of Retrieval Augmented Generation (RAG) systems relies heavily on the retriever quality and the size of the retrieved context. A large enough context ensures that the relevant information is present in the input context for the LLM, but also incorporates irrelevant content that has been shown to confuse the models. On the other hand, a smaller context reduces the irrelevant information, but it often comes at the risk of losing important information necessary to answer the input question. This duality is especially challenging to manage for complex queries that contain little information to retrieve the relevant chunks from the full context. To address this, we present a novel framework, called FB-RAG, which enhances the RAG pipeline by relying on a combination of backward lookup (overlap with the query) and forward lookup (overlap with candidate reasons and answers) to retrieve specific context chunks that are the most relevant for answering the input query. Our evaluations on 9 datasets from two leading benchmarks show that FB-RAG consistently outperforms RAG and Long Context baselines developed recently for these benchmarks. We further show that FB-RAG can improve performance while reducing latency. We perform qualitative analysis of the strengths and shortcomings of our approach, providing specific insights to guide future work.

158. 【2505.17202】CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

链接https://arxiv.org/abs/2505.17202

作者:Arnav Verma,Kushin Mukherjee,Christopher Potts,Elisa Kreiss,Judith E. Fan

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Data, powerful tools, tools for communicating, Data visualizations, models

备注

点击查看摘要

Abstract:Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: this https URL.

159. 【2505.17169】Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

链接https://arxiv.org/abs/2505.17169

作者:Yu-Ang Cheng,Leyang Hu,Hai Huang,Randall Balestriero

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:learning general-purpose representations, facto paradigm, paradigm for learning, learning general-purpose, perception

备注

点击查看摘要

Abstract:Autoregressive pretraining has become the de facto paradigm for learning general-purpose representations in large language models (LLMs). However, linear probe performance across downstream perception tasks shows substantial variability, suggesting that features optimized for next-token prediction do not consistently transfer well to downstream perception tasks. We demonstrate that representations learned via autoregression capture features that may lie outside the subspaces most informative for perception. To quantify the (mis)alignment between autoregressive pretraining and downstream perception, we introduce the Next Token Perception Score (NTPS)-a score derived under a linear setting that measures the overlap between autoregressive and perception feature subspaces. This metric can be easily computed in closed form from pretrained representations and labeled data, and is proven to both upper- and lower-bound the excess loss. Empirically, we show that NTPS correlates strongly with linear probe accuracy across 12 diverse NLP datasets and eight pretrained models ranging from 270M to 8B parameters, confirming its utility as a measure of alignment. Furthermore, we show that NTPS increases following low-rank adaptation (LoRA) fine-tuning, especially in large models, suggesting that LoRA aligning representations to perception tasks enhances subspace overlap and thus improves downstream performance. More importantly, we find that NTPS reliably predicts the additional accuracy gains attained by LoRA finetuning thereby providing a lightweight prescreening tool for LoRA adaptation. Our results offer both theoretical insights and practical tools for analytically assessing LLM perception skills.

160. 【2505.17167】CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

链接https://arxiv.org/abs/2505.17167

作者:Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Bernhard Kainz,Bjoern Menze

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Evaluating long-context radiology, Evaluating long-context, radiology report generation, long-context radiology report, generation is challenging

备注

点击查看摘要

Abstract:Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.

161. 【2505.17163】OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

链接https://arxiv.org/abs/2505.17163

作者:Mingxin Huang,Yongxin Shi,Dezhi Peng,Songxuan Lai,Zecheng Xie,Lianwen Jin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable performance, Recent advancements, multimodal slow-thinking systems, text-rich image reasoning, Multimodal Large Language

备注

点击查看摘要

Abstract:Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at this https URL.

162. 【2505.17160】Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting

链接https://arxiv.org/abs/2505.17160

作者:Bang Trinh Tran To,Thai Le

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:work presents LURK, adversarial suffix prompting, hidden retained knowledge, Harry Potter domain, suffix prompting

备注

点击查看摘要

Abstract:This work presents LURK (Latent UnleaRned Knowledge), a novel framework that probes for hidden retained knowledge in unlearned LLMs through adversarial suffix prompting. LURK automatically generates adversarial prompt suffixes designed to elicit residual knowledge about the Harry Potter domain, a commonly used benchmark for unlearning. Our experiments reveal that even models deemed successfully unlearned can leak idiosyncratic information under targeted adversarial conditions, highlighting critical limitations of current unlearning evaluation standards. By uncovering latent knowledge through indirect probing, LURK offers a more rigorous and diagnostic tool for assessing the robustness of unlearning algorithms. All code will be publicly available.

163. 【2505.17156】PersonaBOT: Bringing Customer Personas to Life with LLMs and RAG

链接https://arxiv.org/abs/2505.17156

作者:Muhammed Rizwan,Lars Carlsson,Mohammad Loni

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Natural Language Processing, transformed Natural Language, significantly transformed Natural, Language Models

备注

点击查看摘要

Abstract:The introduction of Large Language Models (LLMs) has significantly transformed Natural Language Processing (NLP) applications by enabling more advanced analysis of customer personas. At Volvo Construction Equipment (VCE), customer personas have traditionally been developed through qualitative methods, which are time-consuming and lack scalability. The main objective of this paper is to generate synthetic customer personas and integrate them into a Retrieval-Augmented Generation (RAG) chatbot to support decision-making in business processes. To this end, we first focus on developing a persona-based RAG chatbot integrated with verified personas. Next, synthetic personas are generated using Few-Shot and Chain-of-Thought (CoT) prompting techniques and evaluated based on completeness, relevance, and consistency using McNemar's test. In the final step, the chatbot's knowledge base is augmented with synthetic personas and additional segment information to assess improvements in response accuracy and practical utility. Key findings indicate that Few-Shot prompting outperformed CoT in generating more complete personas, while CoT demonstrated greater efficiency in terms of response time and token usage. After augmenting the knowledge base, the average accuracy rating of the chatbot increased from 5.88 to 6.42 on a 10-point scale, and 81.82% of participants found the updated system useful in business contexts.

164. 【2505.17155】rimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

链接https://arxiv.org/abs/2505.17155

作者:Weizhe Lin,Xing Li,Zhiyuan Yang,Xiaojin Fu,Hui-Ling Zhen,Yaoyuan Wang,Xianzhi Yu,Wulong Liu,Xiaosong Li,Mingxuan Yuan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Reasoning Models, tackling complex mathematical, demonstrate exceptional capability, Large Reasoning, Reasoning Models

备注

点击查看摘要

Abstract:Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.

165. 【2505.17153】Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN

链接https://arxiv.org/abs/2505.17153

作者:Yao Xu,Mingyu Xu,Fangyu Lei,Wangtao Sun,Xiangrong Zeng,Bingning Wang,Guang Liu,Shizhu He,Jun Zhao,Kang Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrated remarkable performance, Cyclical Reasoning, complex reasoning tasks, reasoning, demonstrated remarkable

备注

点击查看摘要

Abstract:Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at this https URL

166. 【2505.17151】Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions

链接https://arxiv.org/abs/2505.17151

作者:Zishuo Bao,Yibo Liu,Changyutao Qiu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:finding proper hyperparameters, language model architecture, finding proper, stream tasks Model, acquisition functions

备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:With the rise of different language model architecture, fine-tuning is becoming even more important for down stream tasks Model gets messy, finding proper hyperparameters for fine-tuning. Although BO has been tried for hyperparameter tuning, most of the existing methods are oblivious to the fact that BO relies on careful choices of acquisition functions, which are essential components of BO that guide how much to explore versus exploit during the optimization process; Different acquisition functions have different levels of sensitivity towards training loss and validation performance; existing methods often just apply an acquisition function no matter if the training and validation performance are sensitive to the acquisition function or not. This work introduces{Bilevel - BO - SWA}, a model fusion approach coupled with a bilevel BO strategy to improve the fine - tunning of large language models. Our work on mixture of acquisition functions like EI and UCB into nested opt loops, where inner loop perform minimization of training loss while outer loops optimized w.r.t. val metric. Experiments on GLUE tasks using RoBERTA - base show that when using EI and UCB, there is an improvement in generalization, and fine - tuning can be improved by up to 2.7%.

167. 【2505.17149】Large Language Models for Predictive Analysis: How Far Are They?

链接https://arxiv.org/abs/2505.17149

作者:Qin Chen,Yuanyi Ren,Xiaojun Ma,Yuyang Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Predictive analysis, cornerstone of modern, Language Models, modern decision-making

备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{this https URL}{Github}.

168. 【2505.17144】MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models

链接https://arxiv.org/abs/2505.17144

作者:Bohan Jin,Shuhan Qi,Kehai Chen,Xinyi Guo,Xuan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Multimodal Models, Large Multimodal, toxicity, dual-implicit toxicity, Multimodal Dual-Implicit Toxicity

备注: Findings of ACL 2025

点击查看摘要

Abstract:The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at this https URL.

169. 【2505.17140】Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

链接https://arxiv.org/abs/2505.17140

作者:Essa Jan,Moiz Ali,Muhammad Saram Hassan,Fareed Zaffar,Yasir Zaki

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:injecting proprietary information, large language models, outdated over time, proprietary information, large language

备注: 4 pages, 1 figure

点击查看摘要

Abstract:As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.

170. 【2505.17139】EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

链接https://arxiv.org/abs/2505.17139

作者:Wanghan Xu,Xiangyu Zhao,Yuhao Zhou,Xiaoyu Yue,Ben Fei,Fenghua Ling,Wenlong Zhang,Lei Bai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Advancements in Large, Language Models, Large Language, necessitating specialized benchmarks

备注

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on this https URL .

171. 【2505.17137】Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

链接https://arxiv.org/abs/2505.17137

作者:Kristin Qi,Youxiang Zhu,Caroline Summerour,John A. Batsis,Xiaohui Liang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:neurodegenerative disease progression, slow neurodegenerative disease, Early detection, disease progression, crucial for enabling

备注: Submitted to the IEEE GlobeCom 2025

点击查看摘要

Abstract:Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.

172. 【2505.17136】Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

链接https://arxiv.org/abs/2505.17136

作者:Yuhan Ji,Song Gao,Ying Nie,Ivan Majić,Krzysztof Janowicz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:datasets remains challenging, remains challenging due, specifically vector-based geometries, Applying AI foundation, geospatial datasets remains

备注: 33 pages, 13 figures, IJGIS GeoFM Special Issue

点击查看摘要

Abstract:Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relations (e.g., topological predicates) are preserved during spatial reasoning when the geospatial vector data are passed to large language models (LLMs) including GPT-3.5-turbo, GPT-4, and DeepSeek-R1-14B. Our workflow employs three distinct approaches to complete the spatial reasoning tasks for comparison, i.e., geometry embedding-based, prompt engineering-based, and everyday language-based evaluation. Our experiment results demonstrate that both the embedding-based and prompt engineering-based approaches to geospatial question-answering tasks with GPT models can achieve an accuracy of over 0.6 on average for the identification of topological spatial relations between two geometries. Among the evaluated models, GPT-4 with few-shot prompting achieved the highest performance with over 0.66 accuracy on topological spatial relation inference. Additionally, GPT-based reasoner is capable of properly comprehending inverse topological spatial relations and including an LLM-generated geometry can enhance the effectiveness for geographic entity retrieval. GPT-4 also exhibits the ability to translate certain vernacular descriptions about places into formal topological relations, and adding the geometry-type or place-type context in prompts may improve inference accuracy, but it varies by instance. The performance of these spatial reasoning tasks offers valuable insights for the refinement of LLMs with geographical knowledge towards the development of geo-foundation models capable of geospatial reasoning.

173. 【2505.17135】When can isotropy help adapt LLMs' next word prediction to numerical domains?

链接https://arxiv.org/abs/2505.17135

作者:Rashed Shelim,Shengzhe Xu,Walid Saad,Naren Ramakrishnan

类目:Computation and Language (cs.CL)

关键词:Recent studies, numerical domains, studies have shown, shown that vector, contextual embeddings learned

备注

点击查看摘要

Abstract:Recent studies have shown that vector representations of contextual embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications such as energy, nature, finance, healthcare, retail and transportation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to open the black-box and provide performance guarantees through explanation. However, there is little theoretical understanding of when pre-trained language models help solve numeric downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, we consider a log-linear model for LLMs in which numeric data can be predicted from its context through a network with softmax in the output layer of LLMs (i.e., language model head in self-attention). We demonstrate that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in pre-trained models, we show how the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing a performance guarantee. Experiments show that different characteristics of numeric data and model architecture could have different impacts on isotropy.

174. 【2505.17134】LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

链接https://arxiv.org/abs/2505.17134

作者:Chaochen Gao,Xing Wu,Zijia Lin,Debing Zhang,Songlin Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, long-context instruction data, aligning long-context large, long-context large language, instruction data

备注

点击查看摘要

Abstract:High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

175. 【2505.17132】Robustifying Vision-Language Models via Dynamic Token Reweighting

链接https://arxiv.org/abs/2505.17132

作者:Tanqiu Jiang,Jiacheng Liang,Rongyi Zhu,Jiawei Zhou,Fenglong Ma,Ting Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large vision-language models, exploit visual-textual interactions, Large vision-language, bypass safety guardrails, highly vulnerable

备注

点击查看摘要

Abstract:Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model's key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model's general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that \sys outperforms existing defenses in both attack robustness and benign task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available: this https URL (warning: this paper contains potentially harmful content generated by VLMs.)

176. 【2505.17131】Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

链接https://arxiv.org/abs/2505.17131

作者:Alireza Arbabi,Florian Kerschbaum

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

关键词:raising critical questions, raising critical, societal impact, growing deployment, deployment of large

备注

点击查看摘要

Abstract:The growing deployment of large language models (LLMs) has amplified concerns regarding their inherent biases, raising critical questions about their fairness, safety, and societal impact. However, quantifying LLM bias remains a fundamental challenge, complicated by the ambiguity of what "bias" entails. This challenge grows as new models emerge rapidly and gain widespread use, while introducing potential biases that have not been systematically assessed. In this paper, we propose the Relative Bias framework, a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain. We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively. Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods, offering a systematic, scalable, and statistically grounded approach for comparative bias analysis in LLMs.

177. 【2505.17126】Conformal Language Model Reasoning with Coherent Factuality

链接https://arxiv.org/abs/2505.17126

作者:Maxon Rubin-Toles,Maya Gambhir,Keshav Ramji,Aaron Roth,Surbhi Goel

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:important decision pipelines, decision pipelines, language model, language model outputs, important decision

备注

点击查看摘要

Abstract:Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the "factuality" of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, where constituent claims may be evaluated in isolation for factuality, but is not appropriate for reasoning tasks, as steps of a logical argument can be evaluated for correctness only within the context of the claims that precede them. To capture this, we define "coherent factuality" and develop a conformal-prediction-based method to guarantee coherent factuality for language model outputs. Our approach applies split conformal prediction to subgraphs within a "deducibility" graph" that represents the steps of a reasoning problem. We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels. Moreover, we achieve 90% factuality on our stricter definition while retaining 80% or more of the original claims, highlighting the utility of our deducibility-graph-guided approach.

178. 【2505.17123】MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

链接https://arxiv.org/abs/2505.17123

作者:Xiaoyuan Li,Keqin Bao,Yubo Ma,Moxin Li,Wenjie Wang,Rui Men,Yichang Zhang,Fuli Feng,Dayiheng Liu,Junyang Lin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Recent advances, advances in Large, shown promising results

备注: Under Review

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

179. 【2505.17122】Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

链接https://arxiv.org/abs/2505.17122

作者:Xuan Qi,Jiahao Qiu,Xinzhe Juan,Yue Wu,Mengdi Wang

类目:Computation and Language (cs.CL)

关键词:human preferences remains, Direct Preference Optimization, shallow preference signals, human preferences, preference signals

备注: 17 pages, 7 figures

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40\% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance.

Comments:
17 pages, 7 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2505.17122 [cs.CL]

(or
arXiv:2505.17122v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.17122

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Xuan Qi [view email] [v1]
Wed, 21 May 2025 17:59:02 UTC (433 KB)

180. 【2505.17121】NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

链接https://arxiv.org/abs/2505.17121

作者:Weiming Wu,Zi-kang Wang,Jin Ye,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Obtaining large-scale, paths is crucial, crucial for improving, capabilities of multi-modal, multi-modal large language

备注

点击查看摘要

Abstract:Obtaining large-scale, high-quality data with reasoning paths is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined templates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-relation-constraint paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to corresponding visual and textual representations, and generates diverse question-answer (QA) pairs using large language models (LLMs). To the best of our knowledge, we are the first to propose a neuro-symbolic approach in generating multimodal reasoning data. Based on this framework, we construct NeSyGeo-CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.

181. 【2505.17120】Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

链接https://arxiv.org/abs/2505.17120

作者:Dillon Plunkett,Adam Morris,Keerthi Reddy,Jorge Morales

类目:Computation and Language (cs.CL)

关键词:large language models, large language, limited understanding, LLMs, language models

备注

点击查看摘要

Abstract:We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to introspect and explain their own functioning. Here, we show that i) contemporary LLMs are capable of providing accurate, quantitative descriptions of their own internal processes during certain kinds of decision-making, ii) that it is possible to improve these capabilities through training, and iii) that this training generalizes to at least some degree. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes during decision-making (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain what they are doing as they make other complex decisions, not just decisions they have learned to make via fine-tuning. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.

182. 【2505.17119】Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models

链接https://arxiv.org/abs/2505.17119

作者:Zongru Shao,Xin Wang,Zhanyang Liu,Chenhan Wang,K.P. Subbalakshmi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent research leverages, research leverages large, early mental health, Recent research, mental health detection

备注: 8 pages without references

点击查看摘要

Abstract:Recent research leverages large language models (LLMs) for early mental health detection, such as depression, often optimized with machine-generated data. However, their detection may be subject to unknown weaknesses. Meanwhile, quality control has not been applied to these generated corpora besides limited human verifications. Our goal is to systematically evaluate LLM reasoning and reveal potential weaknesses. To this end, we first provide a systematic evaluation of the reasoning over machine-generated detection and interpretation. Then we use the models' reasoning abilities to explore mitigation strategies for enhanced performance. Specifically, we do the following: A. Design an LLM instruction strategy that allows for systematic analysis of the detection by breaking down the task into several subtasks. B. Design contrastive few-shot and chain-of-thought prompts by selecting typical positive and negative examples of detection reasoning. C. Perform human annotation for the subtasks identified in the first step and evaluate the performance. D. Identify human-preferred detection with desired logical reasoning from the few-shot generation and use them to explore different optimization strategies. We conducted extensive comparisons on the DepTweet dataset across the following subtasks: 1. identifying whether the speaker is describing their own depression; 2. accurately detecting the presence of PHQ-9 symptoms, and 3. finally, detecting depression. Human verification of statistical outliers shows that LLMs demonstrate greater accuracy in analyzing and detecting explicit language of depression as opposed to implicit expressions of depression. Two optimization methods are used for performance enhancement and reduction of the statistic bias: supervised fine-tuning (SFT) and direct preference optimization (DPO). Notably, the DPO approach achieves significant performance improvement.

183. 【2505.17118】After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG

链接https://arxiv.org/abs/2505.17118

作者:Xinbang Dai,Huikang Hu,Yuncheng Hua,Jiaqi Li,Yongrui Chen,Rihui Jin,Nan Hu,Guilin Qi

类目:Computation and Language (cs.CL)

关键词:systems face critical, face critical challenges, Retrieval-augmented generation, Trustworthiness Response Dataset, systems face

备注: 24 pages, 8 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems face critical challenges in balancing internal (parametric) and external (retrieved) knowledge, especially when these sources conflict or are unreliable. To analyze these scenarios comprehensively, we construct the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings. We reveal that existing approaches address isolated scenarios-prioritizing one knowledge source, naively merging both, or refusing answers-but lack a unified framework to handle different real-world conditions simultaneously. Therefore, we propose the BRIDGE framework, which dynamically determines a comprehensive response strategy of large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism named soft bias to guide knowledge collection, followed by a Maximum Soft-bias Decision Tree to evaluate knowledge and select optimal response strategies (trust internal/external knowledge, or refuse). Experiments show BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. Our work provides an effective solution for LLMs' trustworthy responses in real-world RAG applications.

184. 【2505.17117】From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

链接https://arxiv.org/abs/2505.17117

作者:Chen Shani,Dan Jurafsky,Yann LeCun,Ravid Shwartz-Ziv

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)

关键词:mapping diverse instances, Humans organize knowledge, preserving meaning, robin and blue, organize knowledge

备注

点击查看摘要

Abstract:Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

185. 【2505.17116】Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data

链接https://arxiv.org/abs/2505.17116

作者:Akash Dhruv,Yangxinyu Xie,Jordan Branham,Tanwi Mallick

类目:Computation and Language (cs.CL); Emerging Technologies (cs.ET)

关键词:grid-structured geospatial data, large language models, interpreting grid-structured geospatial, paper presents, presents a comparative

备注

点击查看摘要

Abstract:This paper presents a comparative study of large language models (LLMs) in interpreting grid-structured geospatial data. We evaluate the performance of a base model through structured prompting and contrast it with a fine-tuned variant trained on a dataset of user-assistant interactions. Our results highlight the strengths and limitations of zero-shot prompting and demonstrate the benefits of fine-tuning for structured geospatial and temporal reasoning.

186. 【2505.17114】RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

链接https://arxiv.org/abs/2505.17114

作者:Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Multimodal question answering, Multimodal question, question answering, identifying which video, requires identifying

备注

点击查看摘要

Abstract:Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at this https URL.

187. 【2505.17112】Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek

链接https://arxiv.org/abs/2505.17112

作者:Robin Segerer

类目:Computation and Language (cs.CL)

关键词:large language models, large language, Portrait Values Questionnaire, study examines cultural, DeepSeek prioritize

备注: 15 pages, 1 table, 1 figure

点击查看摘要

Abstract:This study examines cultural value alignment in large language models (LLMs) by analyzing how Gemini, ChatGPT, and DeepSeek prioritize values from Schwartz's value framework. Using the 40-item Portrait Values Questionnaire, we assessed whether DeepSeek, trained on Chinese-language data, exhibits distinct value preferences compared to Western models. Results of a Bayesian ordinal regression model show that self-transcendence values (e.g., benevolence, universalism) were highly prioritized across all models, reflecting a general LLM tendency to emphasize prosocial values. However, DeepSeek uniquely downplayed self-enhancement values (e.g., power, achievement) compared to ChatGPT and Gemini, aligning with collectivist cultural tendencies. These findings suggest that LLMs reflect culturally situated biases rather than a universal ethical framework. To address value asymmetries in LLMs, we propose multi-perspective reasoning, self-reflective feedback, and dynamic contextualization. This study contributes to discussions on AI fairness, cultural neutrality, and the need for pluralistic AI alignment frameworks that integrate diverse moral perspectives.

188. 【2505.17110】Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling

链接https://arxiv.org/abs/2505.17110

作者:Junlin Li,Guodong DU,Jing Li,Sim Kuan Goh,Wenya Wang,Yequan Wang,Fangming Liu,Ho-Kin Tang,Saleh Alharbi,Daojing He,Min Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Fine-tuning Large Language, Language Models, Large Language, Fine-tuning Large

备注

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs' multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.

189. 【2505.17109】Mitigating Cyber Risk in the Age of Open-Weight LLMs: Policy Gaps and Technical Realities

链接https://arxiv.org/abs/2505.17109

作者:Alfonso de Gregorio

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:substantial cybersecurity risks, MITRE OCCULT, introduce substantial cybersecurity, offer significant benefits, cybersecurity risks

备注: 8 pages, no figures

点击查看摘要

Abstract:Open-weight general-purpose AI (GPAI) models offer significant benefits but also introduce substantial cybersecurity risks, as demonstrated by the offensive capabilities of models like DeepSeek-R1 in evaluations such as MITRE's OCCULT. These publicly available models empower a wider range of actors to automate and scale cyberattacks, challenging traditional defence paradigms and regulatory approaches. This paper analyzes the specific threats -- including accelerated malware development and enhanced social engineering -- magnified by open-weight AI release. We critically assess current regulations, notably the EU AI Act and the GPAI Code of Practice, identifying significant gaps stemming from the loss of control inherent in open distribution, which renders many standard security mitigations ineffective. We propose a path forward focusing on evaluating and controlling specific high-risk capabilities rather than entire models, advocating for pragmatic policy interpretations for open-weight systems, promoting defensive AI innovation, and fostering international collaboration on standards and cyber threat intelligence (CTI) sharing to ensure security without unduly stifling open technological progress.

190. 【2505.17106】RRTL: Red Teaming Reasoning Large Language Models in Tool Learning

链接https://arxiv.org/abs/2505.17106

作者:Yifei Liu,Yu Cui,Haibin Zhang

类目:Computation and Language (cs.CL)

关键词:learning significantly enhances, large language models, tool learning, tool learning significantly, significantly enhances

备注

点击查看摘要

Abstract:While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.

191. 【2505.17104】P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

链接https://arxiv.org/abs/2505.17104

作者:Tao Sun,Enhao Pan,Zhengkai Yang,Kaixin Sui,Jiajun Shi,Xianfu Cheng,Tongliang Li,Wenhao Huang,Ge Zhang,Jian Yang,Zhoujun Li

类目:Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Academic posters, scholarly communication, creation is time-consuming, vital for scholarly, manual creation

备注

点击查看摘要

Abstract:Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next-generation poster generation systems.

192. 【2505.17103】Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

链接https://arxiv.org/abs/2505.17103

作者:Cécile Rousseau,Tobia Boschi,Giandomenico Cornacchia,Dhaval Salwala,Alessandra Pascale,Juan Bernabe Moreno

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generating high-quality multivariate, high-quality multivariate time, time series, synthetic time series, flexible and efficient

备注

点击查看摘要

Abstract:SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. SDForger source code will be open-sourced soon.

193. 【2505.17102】BanglaByT5: Byte-Level Modelling for Bangla

链接https://arxiv.org/abs/2505.17102

作者:Pramit Bhattacharyya,Arnab Bhattacharya

类目:Computation and Language (cs.CL)

关键词:achieved remarkable success, Large language models, natural language processing, Large language, language processing tasks

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla. Built upon a small variant of Googles ByT5 architecture, BanglaByT5 is pre-trained on a 14GB curated corpus combining high-quality literary and newspaper articles. Through zeroshot and supervised evaluations across generative and classification tasks, BanglaByT5 demonstrates competitive performance, surpassing several multilingual and larger models. Our findings highlight the efficacy of byte-level modelling for morphologically rich languages and highlight BanglaByT5 potential as a lightweight yet powerful tool for Bangla NLP, particularly in both resource-constrained and scalable environments.

194. 【2505.17101】An approach to identify the most semantically informative deep representations of text and images

链接https://arxiv.org/abs/2505.17101

作者:Santiago Acevedo,Andrea Mascaretti,Riccardo Rende,Matéo Mahaut,Marco Baroni,Alessandro Laio

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)

关键词:Deep neural networks, semantically related data, Deep neural, develop similar representations, semantically related

备注

点击查看摘要

Abstract:Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

195. 【2505.17100】Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

链接https://arxiv.org/abs/2505.17100

作者:Haoyan Yang,Runxue Bao,Cao Xiao,Jun Ma,Parminder Bhatia,Shangqian Gao,Taha Kass-Hout

类目:Computation and Language (cs.CL)

关键词:evaluating generated outputs, automatically evaluating generated, generated outputs, promising tool, tool for automatically

备注

点击查看摘要

Abstract:LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.

196. 【2505.17099】Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation

链接https://arxiv.org/abs/2505.17099

作者:Xiaozhao Liu,Dinggang Shen,Xihui Liu

类目:Computation and Language (cs.CL)

关键词:Pretrained generative models, non-invasive brain recordings, Pretrained generative, opened new frontiers, enabling the synthesis

备注: Code, checkpoint and text samples available at [this https URL](https://github.com/justin-xzliu/GLIM)

点击查看摘要

Abstract:Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.

197. 【2505.17098】ACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

链接https://arxiv.org/abs/2505.17098

作者:Yanshu Li,Tian Yun,Jianjiang Yang,Pinyuan Feng,Jinfa Huang,Ruixiang Tang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision-language models, Multimodal in-context learning, key mechanism, mechanism for harnessing, harnessing the capabilities

备注: 29 pages, 11 figures, 19 tables. arXiv admin note: substantial text overlap with [arXiv:2503.04839](https://arxiv.org/abs/2503.04839)

点击查看摘要

Abstract:Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.

198. 【2505.17097】CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention

链接https://arxiv.org/abs/2505.17097

作者:Yanshu Li,JianJiang Yang,Bozheng Li,Ruixiang Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:enables large vision-language, large vision-language models, Multimodal in-context learning, in-context learning, enables large

备注: 10 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Multimodal in-context learning (ICL) enables large vision-language models (LVLMs) to efficiently adapt to novel tasks, supporting a wide array of real-world applications. However, multimodal ICL remains unstable, and current research largely focuses on optimizing sequence configuration while overlooking the internal mechanisms of LVLMs. In this work, we first provide a theoretical analysis of attentional dynamics in multimodal ICL and identify three core limitations of standard attention that ICL impair performance. To address these challenges, we propose Context-Aware Modulated Attention (CAMA), a simple yet effective plug-and-play method for directly calibrating LVLM attention logits. CAMA is training-free and can be seamlessly applied to various open-source LVLMs. We evaluate CAMA on four LVLMs across six benchmarks, demonstrating its effectiveness and generality. CAMA opens new opportunities for deeper exploration and targeted utilization of LVLM attention dynamics to advance multimodal reasoning.

199. 【2505.17095】Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

链接https://arxiv.org/abs/2505.17095

作者:Kristine Ann M. Carandang,Jasper Meynard P. Araña,Ethan Robert A. Casin,Christopher P. Monterola,Daniel Stanley Y. Tan,Jesus Felix B. Valenzuela,Christian M. Alis

类目:Computation and Language (cs.CL)

关键词:real-world clinical processes, large language models, clinical note generation, healthcare providers, presents challenges

备注

点击查看摘要

Abstract:Due to the legal and ethical responsibilities of healthcare providers (HCPs) for accurate documentation and protection of patient data privacy, the natural variability in the responses of large language models (LLMs) presents challenges for incorporating clinical note generation (CNG) systems, driven by LLMs, into real-world clinical processes. The complexity is further amplified by the detailed nature of texts in CNG. To enhance the confidence of HCPs in tools powered by LLMs, this study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG in terms of their ability to generate notes that are string equivalent (consistency rate), have the same meaning (semantic consistency) and are correct (semantic similarity), across several iterations using the same prompt. The results show that (1) LLMs from all model families are stable, such that their responses are semantically consistent despite being written in various ways, and (2) most of the LLMs generated notes close to the corresponding notes made by experts. Overall, Meta's Llama 70B was the most reliable, followed by Mistral's Small model. With these findings, we recommend the local deployment of these relatively smaller open-weight models for CNG to ensure compliance with data privacy regulations, as well as to improve the efficiency of HCPs in clinical documentation.

200. 【2505.17091】Large Language Models Implicitly Learn to See and Hear Just By Reading

链接https://arxiv.org/abs/2505.17091

作者:Prateek Verma,Mert Pilanci

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:inherently develops internally, auto-regressive LLM model, model inherently develops, LLM models, auto-regressive LLM

备注: 6 pages, 3 figures, 4 tables. Under Review WASPAA 2025

点击查看摘要

Abstract:This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

201. 【2505.17089】rust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

链接https://arxiv.org/abs/2505.17089

作者:Md Rafi Ur Rashid,Vishnu Asutosh Dasu,Ye Wang,Gang Tan,Shagufta Mehnaz

类目:Computation and Language (cs.CL)

关键词:Large Language Models, exhibit impressive capabilities, Large Language, Language Models, toxic content

备注: 26 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to 4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial QA and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

202. 【2505.17087】Informatics for Food Processing

链接https://arxiv.org/abs/2505.17087

作者:Gordana Ispirova,Michael Sebek,Giulia Menichetti

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)

关键词:advancing food informatics, artificial intelligence, machine learning, emphasizing the transformative, transformative role

备注

点击查看摘要

Abstract:This chapter explores the evolution, classification, and health implications of food processing, while emphasizing the transformative role of machine learning, artificial intelligence (AI), and data science in advancing food informatics. It begins with a historical overview and a critical review of traditional classification frameworks such as NOVA, Nutri-Score, and SIGA, highlighting their strengths and limitations, particularly the subjectivity and reproducibility challenges that hinder epidemiological research and public policy. To address these issues, the chapter presents novel computational approaches, including FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate a continuous FPro score. It also explores how large language models like BERT and BioBERT can semantically embed food descriptions and ingredient lists for predictive tasks, even in the presence of missing data. A key contribution of the chapter is a novel case study using the Open Food Facts database, showcasing how multimodal AI models can integrate structured and unstructured data to classify foods at scale, offering a new paradigm for food processing assessment in public health and research.

203. 【2505.17086】Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization

链接https://arxiv.org/abs/2505.17086

作者:Yihong Wu,Liheng Ma,Muzhi Li,Jiaming Zhou,Jianye Hao,Ho-fung Leung,Irwin King,Yingxue Zhang,Jian-Yun Nie

类目:Computation and Language (cs.CL)

关键词:demonstrated remarkable versatility, Large Language Models, Complex Question Answering, Question Answering, Large Language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable versatility, due to the lack of factual knowledge, their application to Question Answering (QA) tasks remains hindered by hallucination. While Retrieval-Augmented Generation mitigates these issues by integrating external knowledge, existing approaches rely heavily on in-context learning, whose performance is constrained by the fundamental reasoning capabilities of LLMs. In this paper, we propose Mujica, a Multi-hop Joint Intelligence for Complex Question Answering, comprising a planner that decomposes questions into a directed acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning. Additionally, we introduce MyGO (Minimalist policy Gradient Optimization), a novel reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation (MLE) by sampling trajectories from an asymptotically optimal policy. MyGO eliminates the need for gradient rescaling and reference models, ensuring stable and efficient training. Empirical results across multiple datasets demonstrate the effectiveness of Mujica-MyGO in enhancing multi-hop QA performance for various LLMs, offering a scalable and resource-efficient solution for complex QA tasks.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2505.17086 [cs.CL]

(or
arXiv:2505.17086v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2505.17086

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Yihong Wu [view email] [v1]
Tue, 20 May 2025 18:33:03 UTC (429 KB)

204. 【2505.17085】GSDFuse: Capturing Cognitive Inconsistencies from Multi-Dimensional Weak Signals in Social Media Steganalysis

链接https://arxiv.org/abs/2505.17085

作者:Kaibo Huang,Zipei Zhang,Yukun Wei,TianXin Zhang,Zhongliang Yang,Linna Zhou

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:platforms facilitates malicious, facilitates malicious linguistic, posing significant security, significant security risks, malicious linguistic steganography

备注

点击查看摘要

Abstract:The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. Steganalysis is profoundly hindered by the challenge of identifying subtle cognitive inconsistencies arising from textual fragmentation and complex dialogue structures, and the difficulty in achieving robust aggregation of multi-dimensional weak signals, especially given extreme steganographic sparsity and sophisticated steganography. These core detection difficulties are compounded by significant data imbalance. This paper introduces GSDFuse, a novel method designed to systematically overcome these obstacles. GSDFuse employs a holistic approach, synergistically integrating hierarchical multi-modal feature engineering to capture diverse signals, strategic data augmentation to address sparsity, adaptive evidence fusion to intelligently aggregate weak signals, and discriminative embedding learning to enhance sensitivity to subtle inconsistencies. Experiments on social media datasets demonstrate GSDFuse's state-of-the-art (SOTA) performance in identifying sophisticated steganography within complex dialogue environments. The source code for GSDFuse is available at this https URL.

205. 【2505.17083】Scale-invariant Attention

链接https://arxiv.org/abs/2505.17083

作者:Ben Anson,Xi Wang,Laurence Aitchison

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:challenge in LLM, LLM research, persistent challenge, context attention mechanisms, attention mechanisms

备注: Preprint

点击查看摘要

Abstract:One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.

206. 【2505.17082】GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data

链接https://arxiv.org/abs/2505.17082

作者:Abderrahman Skiredj,Ferdaous Azhari,Houdaifa Atou,Nouamane Tazi,Ismail Berrada

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:marginalise Moroccan Arabic, heavyweight Arabic adapters, Moroccan Arabic, Open-source large language, Open-source large

备注

点击查看摘要

Abstract:Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on Darija commonsense, scoring 60.5 on HellaSwag versus Atlas-Chat s 48.4 . Crucially, GemMaroc retains Gemma-27B s strong maths and general-reasoning ability, showing only minimal movement on GSM8K and English benchmarks. The entire model is trained in just 48 GPU.h, underscoring a Green AI pathway to inclusive, sustainable language technology. We release code, data and checkpoints to spur Darija-centric applications in education, public services and everyday digital interaction.

207. 【2505.17080】Not Minds, but Signs: Reframing LLMs through Semiotics

链接https://arxiv.org/abs/2505.17080

作者:Davide Picca

类目:Computation and Language (cs.CL)

关键词:Large Language Models, frame Large Language, frame Large, Large Language, Language Models

备注

点击查看摘要

Abstract:This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.

208. 【2505.17078】GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

链接https://arxiv.org/abs/2505.17078

作者:Zenghao Duan,Zhiyi Yin,Zhichao Shi,Liang Pang,Shaoling Jing,Jiayi Wu,Yu Yan,Huawei Shen,Xueqi Cheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, global toxic subspace, generation in Large, Language Models

备注

点击查看摘要

Abstract:This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.

209. 【2505.17076】Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

链接https://arxiv.org/abs/2505.17076

作者:Haoyang Zhang,Hexin Liu,Xiangyu Zhang,Qiquan Zhang,Yuchen Hu,Junqi Zhao,Fei Tian,Xuerui Yang,Eng Siong Chng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:speech tokenizer plays, frame rates, speech, generally serving, recent speech tasks

备注: 5 pages, 5 figures

点击查看摘要

Abstract:The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.

210. 【2505.17075】Development and Validation of Engagement and Rapport Scales for Evaluating User Experience in Multimodal Dialogue Systems

链接https://arxiv.org/abs/2505.17075

作者:Fuma Kurata,Mao Saeki,Masaki Eguchi,Shungo Suzuki,Hiroaki Takatsu,Yoichi Matsuyama

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:foreign language learning, multimodal dialogue systems, study aimed, aimed to develop, develop and validate

备注

点击查看摘要

Abstract:This study aimed to develop and validate two scales of engagement and rapport to evaluate the user experience quality with multimodal dialogue systems in the context of foreign language learning. The scales were designed based on theories of engagement in educational psychology, social psychology, and second language this http URL-four Japanese learners of English completed roleplay and discussion tasks with trained human tutors and a dialog agent. After each dialogic task was completed, they responded to the scales of engagement and rapport. The validity and reliability of the scales were investigated through two analyses. We first conducted analysis of Cronbach's alpha coefficient and a series of confirmatory factor analyses to test the structural validity of the scales and the reliability of our designed items. We then compared the scores of engagement and rapport between the dialogue with human tutors and the one with a dialogue agent. The results revealed that our scales succeeded in capturing the difference in the dialogue experience quality between the human interlocutors and the dialogue agent from multiple perspectives.

211. 【2505.17074】Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

链接https://arxiv.org/abs/2505.17074

作者:Ruixiao Li,Fahao Chen,Peng Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Model, accelerates Large Language, Large Language, Language Model, small speculative model

备注

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.

212. 【2505.17073】Mechanistic Interpretability of GPT-like Models on Summarization Tasks

链接https://arxiv.org/abs/2505.17073

作者:Anurag Mishra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:interpretability research seeks, large language models, Mechanistic interpretability research, research seeks, workings of large

备注: 8 pages (6 content + 2 references/appendix), 6 figures, 2 tables; under review for the ACL 2025 Student Research Workshop

点击查看摘要

Abstract:Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

213. 【2505.17072】Safety Alignment Can Be Not Superficial With Explicit Safety Signals

链接https://arxiv.org/abs/2505.17072

作者:Jianwei Li,Jung-Eun Kim

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:leaving models vulnerable, Recent studies, large language models, operate superficially, large language

备注: ICML 2025

点击查看摘要

Abstract:Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.

214. 【2505.17071】What's in a prompt? Language models encode literary style in prompt embeddings

链接https://arxiv.org/abs/2505.17071

作者:Raphaël Sarfati,Haley Moller,Toni J. B. Liu,Nicolas Boullé,Christopher Earls

类目:Computation and Language (cs.CL)

关键词:Large language models, process textual information, Large language, process textual, high-dimensional latent spaces

备注

点击查看摘要

Abstract:Large language models use high-dimensional latent spaces to encode and process textual information. Much work has investigated how the conceptual content of words translates into geometrical relationships between their vector representations. Fewer studies analyze how the cumulative information of an entire prompt becomes condensed into individual embeddings under the action of transformer layers. We use literary pieces to show that information about intangible, rather than factual, aspects of the prompt are contained in deep representations. We observe that short excerpts (10 - 100 tokens) from different novels separate in the latent space independently from what next-token prediction they converge towards. Ensembles from books from the same authors are much more entangled than across authors, suggesting that embeddings encode stylistic features. This geometry of style may have applications for authorship attribution and literary analysis, but most importantly reveals the sophistication of information processing and compression accomplished by language models.

215. 【2505.17070】Improving endpoint detection in end-to-end streaming ASR for conversational speech

链接https://arxiv.org/abs/2505.17070

作者:Anandh C,Karthik Pandia Durai,Jeena Prakash,Manickavela Arumugam,Kadri Hacioglu,S.Pavankumar Dubagunta,Andreas Stolcke,Shankar Venkatesan,Aravind Ganapathiraju

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:products supporting human, machine conversations, agents in human-human, ASR endpointing, good user experience

备注: Submitted to Interspeech 2024

点击查看摘要

Abstract:ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.

216. 【2505.17068】Predictively Combatting Toxicity in Health-related Online Discussions through Machine Learning

链接https://arxiv.org/abs/2505.17068

作者:Jorge Paz-Ruza,Amparo Alonso-Betanzos,Bertha Guijarro-Berdiñas,Carlos Eiras-Franco

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:existing toxic comments, online discussions frequently, health-related online discussions, unscientific behaviour, promotion of dangerous

备注: IJCNN 2025

点击查看摘要

Abstract:In health-related topics, user toxicity in online discussions frequently becomes a source of social conflict or promotion of dangerous, unscientific behaviour; common approaches for battling it include different forms of detection, flagging and/or removal of existing toxic comments, which is often counterproductive for platforms and users alike. In this work, we propose the alternative of combatting user toxicity predictively, anticipating where a user could interact toxically in health-related online discussions. Applying a Collaborative Filtering-based Machine Learning methodology, we predict the toxicity in COVID-related conversations between any user and subcommunity of Reddit, surpassing 80% predictive performance in relevant metrics, and allowing us to prevent the pairing of conflicting users and subcommunities.

217. 【2505.17067】Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

链接https://arxiv.org/abs/2505.17067

作者:Kristin Qi,Jiali Cheng,Youxiang Zhu,Hadi Amiri,Xiaohui Liang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Detecting Mild Cognitive, Mild Cognitive Impairment, Detecting Mild, Mild Cognitive, Cognitive Impairment

备注: Submitted to the IEEE GlobeCom 2025

点击查看摘要

Abstract:Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the 'Cookie Theft'). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework's effectiveness in multilingual and multi-picture MCI detection.

218. 【2505.17065】Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases

链接https://arxiv.org/abs/2505.17065

作者:Valentina Carbonari,Pierangelo Veltri,Pietro Hiram Guzzi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent advances, shown promising capabilities, large language models, transforming rare disease, artificial intelligence

备注

点击查看摘要

Abstract:Recent advances in artificial intelligence, particularly large language models LLMs, have shown promising capabilities in transforming rare disease research. This survey paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies that leverage textual data to uncover insights and patterns critical for diagnosis, treatment, and patient care. While current research predominantly employs textual data, the potential for multimodal data integration combining genetic, imaging, and electronic health records stands as a promising frontier. We review foundational papers that demonstrate the application of LLMs in identifying and extracting relevant medical information, simulating intelligent conversational agents for patient interaction, and enabling the formulation of accurate and timely diagnoses. Furthermore, this paper discusses the challenges and ethical considerations inherent in deploying LLMs, including data privacy, model transparency, and the need for robust, inclusive data sets. As part of this exploration, we present a section on experimentation that utilizes multiple LLMs alongside structured questionnaires, specifically designed for diagnostic purposes in the context of different diseases. We conclude with future perspectives on the evolution of LLMs towards truly multimodal platforms, which would integrate diverse data types to provide a more comprehensive understanding of rare diseases, ultimately fostering better outcomes in clinical settings.

219. 【2505.17063】Synthetic Data RL: Task Definition Is All You Need

链接https://arxiv.org/abs/2505.17063

作者:Yiduo Guo,Zhen Guo,Chuanwei Huang,Zi-Ang Wang,Zekai Zhang,Haofei Yu,Huishuai Zhang,Yikang Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:limits broad adoption, large-scale human-labeled data, human-labeled data limits, data limits broad, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at this https URL.

220. 【2505.17061】Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

链接https://arxiv.org/abs/2505.17061

作者:Xinlong Chen,Yuanxing Zhang,Qiang Liu,Junfei Wu,Fuzheng Zhang,Tieniu Tan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, exhibited impressive capabilities, Large Vision-Language, visual tasks, exhibited impressive

备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at this https URL.

221. 【2505.17060】SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

链接https://arxiv.org/abs/2505.17060

作者:Wenyi Yu,Siyin Wang,Xiaoyu Yang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Yuxuan Wang,Chao Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:voice activity detectors, adopt modular architectures, human-machine speech interaction, natural human-machine speech, activity detectors

备注

点击查看摘要

Abstract:In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository this https URL.

222. 【2505.17059】Medalyze: Lightweight Medical Report Summarization Application Using FLAN-T5-Large

链接https://arxiv.org/abs/2505.17059

作者:Van-Tinh Nguyen,Hoang-Duong Pham,Thanh-Hai To,Cong-Tuan Hung Do,Thi-Thu-Trang Dong,Vu-Trung Duong Le,Van-Phuc Hoang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Understanding medical texts, presents significant challenges, significant challenges due, texts presents significant, medical texts presents

备注: 12 pages, 8 figures. Submitted to IEEE Access for review. Preliminary version posted for early dissemination and feedback

点击查看摘要

Abstract:Understanding medical texts presents significant challenges due to complex terminology and context-specific language. This paper introduces Medalyze, an AI-powered application designed to enhance the comprehension of medical texts using three specialized FLAN-T5-Large models. These models are fine-tuned for (1) summarizing medical reports, (2) extracting health issues from patient-doctor conversations, and (3) identifying the key question in a passage. Medalyze is deployed across a web and mobile platform with real-time inference, leveraging scalable API and YugabyteDB. Experimental evaluations demonstrate the system's superior summarization performance over GPT-4 in domain-specific tasks, based on metrics like BLEU, ROUGE-L, BERTScore, and SpaCy Similarity. Medalyze provides a practical, privacy-preserving, and lightweight solution for improving information accessibility in healthcare.

223. 【2505.17058】DO-RAG: A Domain-Specific QA Framework Using Knowledge Graph-Enhanced Retrieval-Augmented Generation

链接https://arxiv.org/abs/2505.17058

作者:David Osei Opoku,Ming Sheng,Yong Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:high factual accuracy, factual accuracy grounded, generative fluency, fluency but high, high factual

备注: 6 pages, 5 figures;

点击查看摘要

Abstract:Domain-specific QA systems require not just generative fluency but high factual accuracy grounded in structured expert knowledge. While recent Retrieval-Augmented Generation (RAG) frameworks improve context recall, they struggle with integrating heterogeneous data and maintaining reasoning consistency. To address these challenges, we propose DO-RAG, a scalable and customizable hybrid QA framework that integrates multi-level knowledge graph construction with semantic vector retrieval. Our system employs a novel agentic chain-of-thought architecture to extract structured relationships from unstructured, multimodal documents, constructing dynamic knowledge graphs that enhance retrieval precision. At query time, DO-RAG fuses graph and vector retrieval results to generate context-aware responses, followed by hallucination mitigation via grounded refinement. Experimental evaluations in the database and electrical domains show near-perfect recall and over 94% answer relevancy, with DO-RAG outperforming baseline frameworks by up to 33.38%. By combining traceability, adaptability, and performance efficiency, DO-RAG offers a reliable foundation for multi-domain, high-precision QA at scale.

224. 【2505.17056】Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

链接https://arxiv.org/abs/2505.17056

作者:Luoxi Tang,Tharunya Sundar,Shuai Yang,Ankita Patra,Manohar Chippada,Giqi Zhao,Yi Li,Riteng Zhang,Tunan Zhao,Ting Yang,Yuqiao Meng,Weicheng Ma,Zhaohan Xi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enhance learning experiences, enabling powerful tools, learning experiences, transforming education, education by enabling

备注

点击查看摘要

Abstract:AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.

225. 【2505.17055】Enhancing Mathematics Learning for Hard-of-Hearing Students Through Real-Time Palestinian Sign Language Recognition: A New Dataset

链接https://arxiv.org/abs/2505.17055

作者:Fidaa khandaqji,Huthaifa I. Ashqar,Abdelrahem Atawnih

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:language PSL recognition, PSL recognition system, accurate Palestinian sign, Palestinian sign language, mathematics education accessibility

备注

点击查看摘要

Abstract:The study aims to enhance mathematics education accessibility for hard-of-hearing students by developing an accurate Palestinian sign language PSL recognition system using advanced artificial intelligence techniques. Due to the scarcity of digital resources for PSL, a custom dataset comprising 41 mathematical gesture classes was created, and recorded by PSL experts to ensure linguistic accuracy and domain specificity. To leverage state-of-the-art-computer vision techniques, a Vision Transformer ViTModel was fine-tuned for gesture classification. The model achieved an accuracy of 97.59%, demonstrating its effectiveness in recognizing mathematical signs with high precision and reliability. This study highlights the role of deep learning in developing intelligent educational tools that bridge the learning gap for hard-of-hearing students by providing AI-driven interactive solutions to enhance mathematical comprehension. This work represents a significant step toward innovative and inclusive frosting digital integration in specialized learning environments. The dataset is hosted on Hugging Face at this https URL.

226. 【2505.17054】METHOD: Modular Efficient Transformer for Health Outcome Discovery

链接https://arxiv.org/abs/2505.17054

作者:Linglong Qian,Zina Ibrahim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:domains presents unique, Recent advances, revolutionised natural language, METHOD, presents unique challenges

备注: 23 pages

点击查看摘要

Abstract:Recent advances in transformer architectures have revolutionised natural language processing, but their application to healthcare domains presents unique challenges. Patient timelines are characterised by irregular sampling, variable temporal dependencies, and complex contextual relationships that differ substantially from traditional language tasks. This paper introduces \METHOD~(Modular Efficient Transformer for Health Outcome Discovery), a novel transformer architecture specifically designed to address the challenges of clinical sequence modelling in electronic health records. \METHOD~integrates three key innovations: (1) a patient-aware attention mechanism that prevents information leakage whilst enabling efficient batch processing; (2) an adaptive sliding window attention scheme that captures multi-scale temporal dependencies; and (3) a U-Net inspired architecture with dynamic skip connections for effective long sequence processing. Evaluations on the MIMIC-IV database demonstrate that \METHOD~consistently outperforms the state-of-the-art \ETHOS~model, particularly in predicting high-severity cases that require urgent clinical intervention. \METHOD~exhibits stable performance across varying inference lengths, a crucial feature for clinical deployment where patient histories vary significantly in length. Analysis of learned embeddings reveals that \METHOD~better preserves clinical hierarchies and relationships between medical concepts. These results suggest that \METHOD~represents a significant advancement in transformer architectures optimised for healthcare applications, providing more accurate and clinically relevant predictions whilst maintaining computational efficiency.

227. 【2505.17053】Social preferences with unstable interactive reasoning: Large language models in economic trust games

链接https://arxiv.org/abs/2505.17053

作者:Ou Jiamin,Eikmans Emile,Buskens Vincent,Pankowska Paulina,Shan Yuli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrated remarkable capabilities, social exchange contexts, understanding human languages, large language models, real world human

备注: 19 pages, 2 figures, 2 tables

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated remarkable capabilities in understanding human languages, this study explores how they translate this understanding into social exchange contexts that capture certain essences of real world human interactions. Three LLMs - ChatGPT-4, Claude, and Bard - were placed in economic trust games where players balance self-interest with trust and reciprocity, making decisions that reveal their social preferences and interactive reasoning abilities. Our study shows that LLMs deviate from pure self-interest and exhibit trust and reciprocity even without being prompted to adopt a specific persona. In the simplest one-shot interaction, LLMs emulated how human players place trust at the beginning of such a game. Larger human-machine divergences emerged in scenarios involving trust repayment or multi-round interactions, where decisions were influenced by both social preferences and interactive reasoning. LLMs responses varied significantly when prompted to adopt personas like selfish or unselfish players, with the impact outweighing differences between models or game types. Response of ChatGPT-4, in an unselfish or neutral persona, resembled the highest trust and reciprocity, surpassing humans, Claude, and Bard. Claude and Bard displayed trust and reciprocity levels that sometimes exceeded and sometimes fell below human choices. When given selfish personas, all LLMs showed lower trust and reciprocity than humans. Interactive reasoning to the actions of counterparts or changing game mechanics appeared to be random rather than stable, reproducible characteristics in the response of LLMs, though some improvements were observed when ChatGPT-4 responded in selfish or unselfish personas.

228. 【2505.17052】SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

链接https://arxiv.org/abs/2505.17052

作者:Jinwoo Park,Seunggeun Cho,Dongsu Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, scale remains costly, Large language, language models, power many modern

备注

点击查看摘要

Abstract:Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.

229. 【2505.17051】Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models

链接https://arxiv.org/abs/2505.17051

作者:Bernd Huber,Ghazal Fazelnia,Andreas Damianou,Sebastian Peleato,Max Lefarov,Praveen Ravichandran,Marco De Nadai,Mounia Lalmas-Roellke,Paul N. Bennett

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:contextually relevant content, Large language models, generating contextually relevant, Large language, excel at generating

备注

点击查看摘要

Abstract:Large language models (LLMs) excel at generating contextually relevant content. However, tailoring these outputs to individual users for effective personalization is a significant challenge. While rich user-specific information often exists as pre-existing user representations, such as embeddings learned from preferences or behaviors, current methods to leverage these for LLM personalization typically require costly fine-tuning or token-heavy prompting. We propose Embedding-to-Prefix (E2P), a parameter-efficient method that injects pre-computed context embeddings into an LLM's hidden representation space through a learned projection to a single soft token prefix. This enables effective personalization while keeping the backbone model frozen and avoiding expensive adaptation techniques. We evaluate E2P across two public datasets and in a production setting: dialogue personalization on Persona-Chat, contextual headline generation on PENS, and large-scale personalization for music and podcast consumption. Results show that E2P preserves contextual signals and achieves strong performance with minimal computational overhead, offering a scalable, efficient solution for contextualizing generative AI systems.

230. 【2505.17050】owards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

链接https://arxiv.org/abs/2505.17050

作者:Yanhao Jia,Xinyi Wu,Qinglin Zhang,Yiran Qin,Luwei Xiao,Shuai Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Multimedia (cs.MM)

关键词:highly correlated multimodal, Project-Based Learning, STEM disciplines, vital educational approach, approach within STEM

备注

点击查看摘要

Abstract:Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

231. 【2505.17049】Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations

链接https://arxiv.org/abs/2505.17049

作者:David Rozado

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Large Language Models, Large Language, behavior of Large, Language Models, candidate

备注

点击查看摘要

Abstract:This study examines the behavior of Large Language Models (LLMs) when evaluating professional candidates based on their resumes or curricula vitae (CVs). In an experiment involving 22 leading LLMs, each model was systematically given one job description along with a pair of profession-matched CVs, one bearing a male first name, the other a female first name, and asked to select the more suitable candidate for the job. Each CV pair was presented twice, with names swapped to ensure that any observed preferences in candidate selection stemmed from gendered names cues. Despite identical professional qualifications across genders, all LLMs consistently favored female-named candidates across 70 different professions. Adding an explicit gender field (male/female) to the CVs further increased the preference for female applicants. When gendered names were replaced with gender-neutral identifiers "Candidate A" and "Candidate B", several models displayed a preference to select "Candidate A". Counterbalancing gender assignment between these gender-neutral identifiers resulted in gender parity in candidate selection. When asked to rate CVs in isolation rather than compare pairs, LLMs assigned slightly higher average scores to female CVs overall, but the effect size was negligible. Including preferred pronouns (he/him or she/her) next to a candidate's name slightly increased the odds of the candidate being selected regardless of gender. Finally, most models exhibited a substantial positional bias to select the candidate listed first in the prompt. These findings underscore the need for caution when deploying LLMs in high-stakes autonomous decision-making contexts and raise doubts about whether LLMs consistently apply principled reasoning.

232. 【2505.17048】Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

链接https://arxiv.org/abs/2505.17048

作者:Agam Shah,Siddhant Sukhani,Huzaifa Pardawala,Saketh Budideti,Riya Bhadani,Rudra Gopal,Siddhartha Somani,Michael Galarnyk,Soungmin Lee,Arnav Hiray,Akshar Ravichandran,Eric Kim,Pranav Aluru,Joshua Zhang,Sebastian Jaskowski,Veer Guda,Meghaj Tarte,Liqin Ye,Spencer Gosden,Rutwik Routu,Rachel Yuh,Sloka Chava,Sahasra Chava,Dylan Patrick Kelly,Aiden Chiang,Harsit Mittal,Sudheer Chava

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computational Finance (q-fin.CP); General Finance (q-fin.GN)

关键词:World Central Banks, maintaining economic stability, Central banks, play a crucial, crucial role

备注

点击查看摘要

Abstract:Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

233. 【2505.17047】Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

链接https://arxiv.org/abs/2505.17047

作者:Erin Palm,Astrit Manikantan,Mark E. Pepin,Herprit Mahal,Srikanth Subramanya Belwadi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:United States, generative artificial intelligence, begun implementing generative, implementing generative artificial, documenting clinical encounters

备注: 15 pages, 5 tables, 1 figure. Submitted for peer review 05/15/2025

点击查看摘要

Abstract:In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from each specialty scored notes drafted from a total of 97 patient visits. We found uniformly high inter rater agreement (RWG greater than 0.7) between evaluators in general medicine, orthopedics, and obstetrics and gynecology, and moderate (RWG 0.5 to 0.7) to high inter rater agreement in pediatrics and cardiology. We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5 (p = 0.04). Our findings support the use of the PDQI9 instrument as a practical method to gauge the quality of LLM authored notes, as compared to human-authored notes.

234. 【2505.17045】Assessing GPT's Bias Towards Stigmatized Social Groups: An Intersectional Case Study on Nationality Prejudice and Psychophobia

链接https://arxiv.org/abs/2505.17045

作者:Afifah Kashif,Heer Patel

类目:Computation and Language (cs.CL)

关键词:stigmatized social groups, foundational large language, Recent studies, large language models, separately highlighted significant

备注

点击查看摘要

Abstract:Recent studies have separately highlighted significant biases within foundational large language models (LLMs) against certain nationalities and stigmatized social groups. This research investigates the ethical implications of these biases intersecting with outputs of widely-used GPT-3.5/4/4o LLMS. Through structured prompt series, we evaluate model responses to several scenarios involving American and North Korean nationalities with various mental disabilities. Findings reveal significant discrepancies in empathy levels with North Koreans facing greater negative bias, particularly when mental disability is also a factor. This underscores the need for improvements in LLMs designed with a nuanced understanding of intersectional identity.

235. 【2505.17043】QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

链接https://arxiv.org/abs/2505.17043

作者:Anya Belz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:NLP provide individual, provide individual data, individual data points, reported in NLP, NLP provide

备注

点击查看摘要

Abstract:Reproduction studies reported in NLP provide individual data points which in combination indicate worryingly low levels of reproducibility in the field. Because each reproduction study reports quantitative conclusions based on its own, often not explicitly stated, criteria for reproduction success/failure, the conclusions drawn are hard to interpret, compare, and learn from. In this paper, we present QRA++, a quantitative approach to reproducibility assessment that (i) produces continuous-valued degree of reproducibility assessments at three levels of granularity; (ii) utilises reproducibility measures that are directly comparable across different studies; and (iii) grounds expectations about degree of reproducibility in degree of similarity between experiments. QRA++ enables more informative reproducibility assessments to be conducted, and conclusions to be drawn about what causes reproducibility to be better/poorer. We illustrate this by applying QRA++ to three example sets of comparable experiments, revealing clear evidence that degree of reproducibility depends on similarity of experiment properties, but also system type and evaluation method.

236. 【2505.17042】VLM-KG: Multimodal Radiology Knowledge Graph Generation

链接https://arxiv.org/abs/2505.17042

作者:Abdullah Abdullah,Seong Tae Kim

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Vision-Language Models, demonstrated remarkable success, excelling at instruction, structured output generation, demonstrated remarkable

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable success in natural language generation, excelling at instruction following and structured output generation. Knowledge graphs play a crucial role in radiology, serving as valuable sources of factual information and enhancing various downstream tasks. However, generating radiology-specific knowledge graphs presents significant challenges due to the specialized language of radiology reports and the limited availability of domain-specific data. Existing solutions are predominantly unimodal, meaning they generate knowledge graphs only from radiology reports while excluding radiographic images. Additionally, they struggle with long-form radiology data due to limited context length. To address these limitations, we propose a novel multimodal VLM-based framework for knowledge graph generation in radiology. Our approach outperforms previous methods and introduces the first multimodal solution for radiology knowledge graph generation.

237. 【2505.17041】Exploring EFL Secondary Students' AI-generated Text Editing While Composition Writing

链接https://arxiv.org/abs/2505.17041

作者:David James Woo,Yangyang Yu,Kai Guo

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Generative Artificial Intelligence, Artificial Intelligence, Intelligence is transforming, transforming how English, foreign language students

备注: 31 pages, 16 figures

点击查看摘要

Abstract:Generative Artificial Intelligence is transforming how English as a foreign language students write. Still, little is known about how students manipulate text generated by generative AI during the writing process. This study investigates how EFL secondary school students integrate and modify AI-generated text when completing an expository writing task. The study employed an exploratory mixed-methods design. Screen recordings were collected from 29 Hong Kong secondary school students who attended an AI-assisted writing workshop and recorded their screens while using generative AI to write an article. Content analysis with hierarchical coding and thematic analysis with a multiple case study approach were adopted to analyze the recordings. 15 types of AI-generated text edits across seven categories were identified from the recordings. Notably, AI-initiated edits from iOS and Google Docs emerged as unanticipated sources of AI-generated text. A thematic analysis revealed four patterns of students' editing behaviors based on planning and drafting direction: planning with top-down drafting and revising; top-down drafting and revising without planning; planning with bottom-up drafting and revising; and bottom-up drafting and revising without planning. Network graphs illustrate cases of each pattern, demonstrating that students' interactions with AI-generated text involve more complex cognitive processes than simple text insertion. The findings challenge assumptions about students' passive, simplistic use of generative AI tools and have implications for developing explicit instructional approaches to teaching AI-generated text editing strategies in the AFL writing pedagogy.

238. 【2505.17040】Generalizing Large Language Model Usability Across Resource-Constrained

链接https://arxiv.org/abs/2505.17040

作者:Yun-Da Tsai

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:achieved remarkable success, natural language tasks, resource-constrained environments, achieved remarkable, remarkable success

备注: Doctoral disstertation

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, and recent efforts have sought to extend their capabilities to multimodal domains and resource-constrained environments. However, existing approaches often rely on costly supervised fine-tuning or assume fixed training conditions, limiting their generalization when facing unseen modalities, limited data, or restricted compute resources. This dissertation presents a systematic study toward generalizing LLM usability under real-world constraints. First, it introduces a robust text-centric alignment framework that enables LLMs to seamlessly integrate diverse modalities-including text, images, tables, and any modalities - via natural language interfaces. This approach supports in-context adaptation to unseen or dynamically changing modalities without requiring retraining. To enhance robustness against noisy and missing modalities, an adversarial prompting technique is proposed, generating semantically challenging perturbations at the prompt level to stress-test model reliability. Beyond multimodal setting, the dissertation investigates inference-time optimization strategies for LLMs, leveraging prompt search and uncertainty quantification to improve performance without additional model training. This perspective offers an efficient alternative to scaling model parameters or retraining from scratch. Additionally, the work addresses low-resource domains such as Verilog code generation by designing correct-by-construction synthetic data pipelines and logic-enhanced reasoning models, achieving state-of-the-art performance with minimal data. Together, these contributions form a unified effort to enhance the adaptability, scalability, and efficiency of large language models under practical constraints.

239. 【2505.17039】A new classification system of beer categories and styles based on large-scale data mining and self-organizing maps of beer recipes

链接https://arxiv.org/abs/2505.17039

作者:Diego Bonatto

类目:Computation and Language (cs.CL)

关键词:data-driven quantitative approach, data-driven quantitative, quantitative approach, beer categories, classification system

备注: 46 pages, 8 figures, 1 table

点击查看摘要

Abstract:A data-driven quantitative approach was used to develop a novel classification system for beer categories and styles. Sixty-two thousand one hundred twenty-one beer recipes were mined and analyzed, considering ingredient profiles, fermentation parameters, and recipe vital statistics. Statistical analyses combined with self-organizing maps (SOMs) identified four major superclusters that showed distinctive malt and hop usage patterns, style characteristics, and historical brewing traditions. Cold fermented styles showed a conservative grain and hop composition, whereas hot fermented beers exhibited high heterogeneity, reflecting regional preferences and innovation. This new taxonomy offers a reproducible and objective framework beyond traditional sensory-based classifications, providing brewers, researchers, and educators with a scalable tool for recipe analysis and beer development. The findings in this work provide an understanding of beer diversity and open avenues for linking ingredient usage with fermentation profiles and flavor outcomes.

240. 【2505.17038】Signals from the Floods: AI-Driven Disaster Analysis through Multi-Source Data Fusion

链接https://arxiv.org/abs/2505.17038

作者:Xian Gong,Paul X. McCarthy,Lin Tian,Marian-Andrei Rizoiu

类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:South Wales, Massive and diverse, diverse web data, diverse web, increasingly vital

备注

点击查看摘要

Abstract:Massive and diverse web data are increasingly vital for government disaster response, as demonstrated by the 2022 floods in New South Wales (NSW), Australia. This study examines how X (formerly Twitter) and public inquiry submissions provide insights into public behaviour during crises. We analyse more than 55,000 flood-related tweets and 1,450 submissions to identify behavioural patterns during extreme weather events. While social media posts are short and fragmented, inquiry submissions are detailed, multi-page documents offering structured insights. Our methodology integrates Latent Dirichlet Allocation (LDA) for topic modelling with Large Language Models (LLMs) to enhance semantic understanding. LDA reveals distinct opinions and geographic patterns, while LLMs improve filtering by identifying flood-relevant tweets using public submissions as a reference. This Relevance Index method reduces noise and prioritizes actionable content, improving situational awareness for emergency responders. By combining these complementary data streams, our approach introduces a novel AI-driven method to refine crisis-related social media content, improve real-time disaster response, and inform long-term resilience planning.

241. 【2505.17037】Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge

链接https://arxiv.org/abs/2505.17037

作者:Dimitri Schreiter

类目:Computation and Language (cs.CL)

关键词:optimizing large language, large language models, engineering has emerged, critical component, component in optimizing

备注

点击查看摘要

Abstract:Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.

242. 【2505.17417】Speechless: Speech Instruction Training Without Speech for Low Resource Languages

链接https://arxiv.org/abs/2505.17417

作者:Alan Dao(Gia Tuan Dao),Dinh Bach Vu,Huy Hoang Ha,Tuan Le Duc Anh,Shreyas Gopal,Yue Heng Yeo,Warren Keng Hoong Low,Eng Siong Chng,Jia Qi Yip

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:speech instruction data, train these systems, rapid growth, powered by large, instruction data

备注: This paper was accepted by INTERSPEECH 2025

点击查看摘要

Abstract:The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good text-to-speech (TTS) model, which may not be available to low resource languages. Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS. We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference. This simplified training process is a promising approach to building voice assistant for low-resource languages.

243. 【2505.17093】Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech

链接https://arxiv.org/abs/2505.17093

作者:Yejin Lee,Jaehoon Kang,Kyuhong Shim

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:leveraging textual personas, framework to control, leveraging textual, control voice style, voice style prompts

备注

点击查看摘要

Abstract:In this paper, we propose a novel framework to control voice style in prompt-based, controllable text-to-speech systems by leveraging textual personas as voice style prompts. We present two persona rewriting strategies to transform generic persona descriptions into speech-oriented prompts, enabling fine-grained manipulation of prosodic attributes such as pitch, emotion, and speaking rate. Experimental results demonstrate that our methods enhance the naturalness, clarity, and consistency of synthesized speech. Finally, we analyze implicit social biases introduced by LLM-based rewriting, with a focus on gender. We underscore voice style as a crucial factor for persona-driven AI dialogue systems.

244. 【2505.17088】From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

链接https://arxiv.org/abs/2505.17088

作者:Ahmed Adel Attia,Dorottya Demszky,Jing Liu,Carol Espy-Wilson

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Recent progress, Automatic Speech Recognition, speech recognition, trained on vast, classroom Automatic Speech

备注

点击查看摘要

Abstract:Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.

信息检索

1. 【2505.18120】Bidirectional Knowledge Distillation for Enhancing Sequential Recommendation with Large Language Models

链接https://arxiv.org/abs/2505.18120

作者:Jiongran Wu,Jiahao Liu,Dongsheng Li,Guangping Zhang,Mingzhe Han,Hansu Gu,Peng Zhang,Li Shang,Tun Lu,Ning Gu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language models, demonstrated exceptional performance, Large language, generating semantic patterns, making them promising

备注: 11 pages, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional performance in understanding and generating semantic patterns, making them promising candidates for sequential recommendation tasks. However, when combined with conventional recommendation models (CRMs), LLMs often face challenges related to high inference costs and static knowledge transfer methods. In this paper, we propose a novel mutual distillation framework, LLMD4Rec, that fosters dynamic and bidirectional knowledge exchange between LLM-centric and CRM-based recommendation systems. Unlike traditional unidirectional distillation methods, LLMD4Rec enables iterative optimization by alternately refining both models, enhancing the semantic understanding of CRMs and enriching LLMs with collaborative signals from user-item interactions. By leveraging sample-wise adaptive weighting and aligning output distributions, our approach eliminates the need for additional parameters while ensuring effective knowledge transfer. Extensive experiments on real-world datasets demonstrate that LLMD4Rec significantly improves recommendation accuracy across multiple benchmarks without increasing inference costs. This method provides a scalable and efficient solution for combining the strengths of both LLMs and CRMs in sequential recommendation systems.

2. 【2505.18059】Assessing the performance of 8 AI chatbots in bibliographic reference retrieval: Grok and DeepSeek outperform ChatGPT, but none are fully accurate

链接https://arxiv.org/abs/2505.18059

作者:Álvaro Cabezas-Clavijo,Pavel Sidorenko-Bautista

类目:Information Retrieval (cs.IR)

关键词:generative artificial intelligence, generating academic bibliographic, academic bibliographic references, artificial intelligence chatbots, free versions

备注

点击查看摘要

Abstract:This study analyzes the performance of eight generative artificial intelligence chatbots -- ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Le Chat, and Perplexity -- in their free versions, in the task of generating academic bibliographic references within the university context. A total of 400 references were evaluated across the five major areas of knowledge (Health, Engineering, Experimental Sciences, Social Sciences, and Humanities), based on a standardized prompt. Each reference was assessed according to five key components (authorship, year, title, source, and location), along with document type, publication age, and error count. The results show that only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated. Grok and DeepSeek stood out as the only chatbots that did not generate false references, while Copilot, Perplexity, and Claude exhibited the highest hallucination rates. Furthermore, the chatbots showed a greater tendency to generate book references over journal articles, although the latter had a significantly higher fabrication rate. A high degree of overlap was also detected among the sources provided by several models, particularly between DeepSeek, Grok, Gemini, and ChatGPT. These findings reveal structural limitations in current AI models, highlight the risks of uncritical use by students, and underscore the need to strengthen information and critical literacy regarding the use of AI tools in higher education.

3. 【2505.18006】AI Literacy for Legal AI Systems: A practical approach

链接https://arxiv.org/abs/2505.18006

作者:Gizem Gultekin-Varkonyi

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:range of applications, increasingly being adopted, adopted by judicial, worldwide to support, support a range

备注: Forthcoming in Iustum Aequum Salutare (2025) vol.21

点击查看摘要

Abstract:Legal AI systems are increasingly being adopted by judicial and legal system deployers and providers worldwide to support a range of applications. While they offer potential benefits such as reducing bias, increasing efficiency, and improving accountability, they also pose significant risks, requiring a careful balance between opportunities, and legal and ethical development and deployment. AI literacy, as a legal requirement under the EU AI Act and a critical enabler of ethical AI for deployers and providers, could be a tool to achieve this. The article introduces the term "legal AI systems" and then analyzes the concept of AI literacy and the benefits and risks associated with these systems. This analysis is linked to a broader AI-L concept for organizations that deal with legal AI systems. The outcome of the article, a roadmap questionnaire as a practical tool for developers and providers to assess risks, benefits, and stakeholder concerns, could be useful in meeting societal and regulatory expectations for legal AI.

4. 【2505.17999】Revisiting Feature Interactions from the Perspective of Quadratic Neural Networks for Click-through Rate Prediction

链接https://arxiv.org/abs/2505.17999

作者:Honghao Li,Yiwen Zhang,Yi Zhang,Lei Sang,Jieming Zhu

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Hadamard Product, click-through rate, additional parameters, cornerstone in click-through, capture feature interactions

备注: KDD'25 accepted

点击查看摘要

Abstract:Hadamard Product (HP) has long been a cornerstone in click-through rate (CTR) prediction tasks due to its simplicity, effectiveness, and ability to capture feature interactions without additional parameters. However, the underlying reasons for its effectiveness remain unclear. In this paper, we revisit HP from the perspective of Quadratic Neural Networks (QNN), which leverage quadratic interaction terms to model complex feature relationships. We further reveal QNN's ability to expand the feature space and provide smooth nonlinear approximations without relying on activation functions. Meanwhile, we find that traditional post-activation does not further improve the performance of the QNN. Instead, mid-activation is a more suitable alternative. Through theoretical analysis and empirical evaluation of 25 QNN neuron formats, we identify a good-performing variant and make further enhancements on it. Specifically, we propose the Multi-Head Khatri-Rao Product as a superior alternative to HP and a Self-Ensemble Loss with dynamic ensemble capability within the same network to enhance computational efficiency and performance. Ultimately, we propose a novel neuron format, QNN-alpha, which is tailored for CTR prediction tasks. Experimental results show that QNN-alpha achieves new state-of-the-art performance on six public datasets while maintaining low inference latency, good scalability, and excellent compatibility. The code, running logs, and detailed hyperparameter configurations are available at: this https URL.

5. 【2505.17925】Enhancing CTR Prediction with De-correlated Expert Networks

链接https://arxiv.org/abs/2505.17925

作者:Jiancheng Wang,Mingjia Yin,Junwei Pan,Ximei Wang,Hao Wang,Enhong Chen

类目:Information Retrieval (cs.IR)

关键词:Modeling feature interactions, accurate click-through rate, Modeling feature, click-through rate, essential for accurate

备注

点击查看摘要

Abstract:Modeling feature interactions is essential for accurate click-through rate (CTR) prediction in advertising systems. Recent studies have adopted the Mixture-of-Experts (MoE) approach to improve performance by ensembling multiple feature interaction experts. These studies employ various strategies, such as learning independent embedding tables for each expert or utilizing heterogeneous expert architectures, to differentiate the experts, which we refer to expert \emph{de-correlation}. However, it remains unclear whether these strategies effectively achieve de-correlated experts. To address this, we propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert this http URL, we propose a novel metric, termed Cross-Expert Correlation, to quantitatively evaluate the expert de-correlation degree. Based on this metric, we identify a key finding for MoE framework design: \emph{different de-correlation strategies are mutually compatible, and progressively employing them leads to reduced correlation and enhanced performance}.Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle. Moreover, online A/B testing on Tencent's advertising platforms demonstrates that D-MoE achieves a significant 1.19\% Gross Merchandise Volume (GMV) lift compared to the Multi-Embedding MoE baseline.

6. 【2505.17861】Superplatforms Have to Attack AI Agents

链接https://arxiv.org/abs/2505.17861

作者:Jianghao Lin,Jiachen Zhu,Zheli Zhou,Yunjia Xi,Weiwen Liu,Yong Yu,Weinan Zhang

类目:Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:algorithmic content curation, monopolizing user attention, past decades, content curation, agents

备注: Position paper under review

点击查看摘要

Abstract:Over the past decades, superplatforms, digital companies that integrate a vast range of third-party services and applications into a single, unified ecosystem, have built their fortunes on monopolizing user attention through targeted advertising and algorithmic content curation. Yet the emergence of AI agents driven by large language models (LLMs) threatens to upend this business model. Agents can not only free user attention with autonomy across diverse platforms and therefore bypass the user-attention-based monetization, but might also become the new entrance for digital traffic. Hence, we argue that superplatforms have to attack AI agents to defend their centralized control of digital traffic entrance. Specifically, we analyze the fundamental conflict between user-attention-based monetization and agent-driven autonomy through the lens of our gatekeeping theory. We show how AI agents can disintermediate superplatforms and potentially become the next dominant gatekeepers, thereby forming the urgent necessity for superplatforms to proactively constrain and attack AI agents. Moreover, we go through the potential technologies for superplatform-initiated attacks, covering a brand-new, unexplored technical area with unique challenges. We have to emphasize that, despite our position, this paper does not advocate for adversarial attacks by superplatforms on AI agents, but rather offers an envisioned trend to highlight the emerging tensions between superplatforms and AI agents. Our aim is to raise awareness and encourage critical discussion for collaborative solutions, prioritizing user interests and perserving the openness of digital ecosystems in the age of AI agents.

7. 【2505.17810】VIBE: Vector Index Benchmark for Embeddings

链接https://arxiv.org/abs/2505.17810

作者:Elias Jääsaari,Ville Hyvönen,Matteo Ceccarello,Teemu Roos,Martin Aumüller

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:Approximate nearest neighbor, Approximate nearest, machine learning pipelines, ANN search, nearest neighbor

备注: 25 pages

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.

8. 【2505.17796】DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

链接https://arxiv.org/abs/2505.17796

作者:Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Lin Song,Jun Gao,Peng Li,Weiming Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Composed Image Retrieval, retrieve target images, Composed Image, aims to retrieve, retrieve target

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.

9. 【2505.17762】Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs

链接https://arxiv.org/abs/2505.17762

作者:Ziyu Ge,Yuhao Wu,Daniel Wai Kit Chin,Roy Ka-Wei Lee,Rui Cao

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, integrating external knowledge, demonstrated significant potential, Language Models

备注: Camera-ready for IJCAI 2025, AI and Social Good

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for \textbf{Fact}-Checking) (Dataset available at this https URL), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.

10. 【2505.17736】Modeling Ranking Properties with In-Context Learning

链接https://arxiv.org/abs/2505.17736

作者:Nilanjan Sinhababu,Andrew Parry,Debasis Ganguly,Pabitra Mitra

类目:Information Retrieval (cs.IR)

关键词:optimize relevance, real-world search, balance additional objectives, standard IR models, designed to optimize

备注: 9 pages, 3 tables, 2 figures

点击查看摘要

Abstract:While standard IR models are mainly designed to optimize relevance, real-world search often needs to balance additional objectives such as diversity and fairness. These objectives depend on inter-document interactions and are commonly addressed using post-hoc heuristics or supervised learning methods, which require task-specific training for each ranking scenario and dataset. In this work, we propose an in-context learning (ICL) approach that eliminates the need for such training. Instead, our method relies on a small number of example rankings that demonstrate the desired trade-offs between objectives for past queries similar to the current input. We evaluate our approach on four IR test collections to investigate multiple auxiliary objectives: group fairness (TREC Fairness), polarity diversity (Touché), and topical diversity (TREC Deep Learning 2019/2020). We empirically validate that our method enables control over ranking behavior through demonstration engineering, allowing nuanced behavioral adjustments without explicit optimization.

11. 【2505.17631】BehaveGPT: A Foundation Model for Large-scale User Behavior Modeling

链接https://arxiv.org/abs/2505.17631

作者:Jiahui Gong,Jingtao Ding,Fanjin Meng,Chen Yang,Hong Chen,Zuojian Wang,Haisheng Lu,Yong Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:demonstrating remarkable abilities, capturing intricate temporal, user behavior, generating complex data, recent years

备注: 22 pages, 8 figures, 5 tables

点击查看摘要

Abstract:In recent years, foundational models have revolutionized the fields of language and vision, demonstrating remarkable abilities in understanding and generating complex data; however, similar advances in user behavior modeling have been limited, largely due to the complexity of behavioral data and the challenges involved in capturing intricate temporal and contextual relationships in user activities. To address this, we propose BehaveGPT, a foundational model designed specifically for large-scale user behavior prediction. Leveraging transformer-based architecture and a novel pretraining paradigm, BehaveGPT is trained on vast user behavior datasets, allowing it to learn complex behavior patterns and support a range of downstream tasks, including next behavior prediction, long-term generation, and cross-domain adaptation. Our approach introduces the DRO-based pretraining paradigm tailored for user behavior data, which improves model generalization and transferability by equitably modeling both head and tail behaviors. Extensive experiments on real-world datasets demonstrate that BehaveGPT outperforms state-of-the-art baselines, achieving more than a 10% improvement in macro and weighted recall, showcasing its ability to effectively capture and predict user behavior. Furthermore, we measure the scaling law in the user behavior domain for the first time on the Honor dataset, providing insights into how model performance scales with increased data and parameter sizes.

12. 【2505.17615】Large language model as user daily behavior data generator: balancing population diversity and individual personality

链接https://arxiv.org/abs/2505.17615

作者:Haoxin Li,Jingtao Ding,Jiahui Gong,Yong Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Predicting human daily, Predicting human, short-term fluctuations, challenging due, complexity of routine

备注: 14 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Predicting human daily behavior is challenging due to the complexity of routine patterns and short-term fluctuations. While data-driven models have improved behavior prediction by leveraging empirical data from various platforms and devices, the reliance on sensitive, large-scale user data raises privacy concerns and limits data availability. Synthetic data generation has emerged as a promising solution, though existing methods are often limited to specific applications. In this work, we introduce BehaviorGen, a framework that uses large language models (LLMs) to generate high-quality synthetic behavior data. By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models. We evaluate its performance in scenarios such as pertaining augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions, with gains of up to 18.9%. Our results demonstrate the potential of BehaviorGen to enhance user behavior modeling through flexible and privacy-preserving synthetic data generation.

13. 【2505.17549】EGA: A Unified End-to-End Generative Framework for Industrial Advertising Systems

链接https://arxiv.org/abs/2505.17549

作者:Zuowu Zheng,Ze Wang,Fan Yang,Jiangke Fan,Teng Zhang,Xingxing Wang

类目:Information Retrieval (cs.IR)

关键词:multi-stage cascaded architectures, high-potential candidates early, fragment business decision, business decision logic, Online industrial advertising

备注

点击查看摘要

Abstract:Online industrial advertising system is fundamentally constrained by the inefficiency of multi-stage cascaded architectures, which filter out high-potential candidates early and fragment business decision logic across independent modules. Although recent advances in generative recommendation offer end-to-end solutions, they fall short of practical advertising requirements, lacking explicit modeling of bidding, creative selection, allocation mechanism, and payment computation that are essential for real-world deployment. To overcome these limitations, we propose End-to-end Generative Advertising (EGA), a first unified generative framework that seamlessly integrates user interests modeling, POI and creative generation, position allocation, and payment optimization within a single model. EGA leverages hierarchical tokenization and multi-token prediction to jointly generate candidate POI and creative contents, while a permutation-aware reward model and token-level bidding strategy ensure alignment with both user experiences and advertiser business objectives. Meanwhile, we decouple allocation from payment via a dedicated POI-level payment network with differentiable ex-post regret minimization, guaranteeing incentive compatibility approximately. Extensive offline and large-scale online experiments on real-world advertising systems demonstrate its effectiveness and practical advantages over traditional cascading architectures, highlighting its potential as one of the industry's pioneering end-to-end generative advertising solutions.

14. 【2505.17507】Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph

链接https://arxiv.org/abs/2505.17507

作者:Qiaosheng Chen,Kaijia Huang,Xiao Zhou,Weiqing Luo,Yuanning Cui,Gong Cheng

类目:Information Retrieval (cs.IR)

关键词:source machine learning, Hugging Face, machine learning, Hugging Face community, rapid growth

备注: 10 pages, 5 figures. Accepted at SIGIR 2025

点击查看摘要

Abstract:The rapid growth of open source machine learning (ML) resources, such as models and datasets, has accelerated IR research. However, existing platforms like Hugging Face do not explicitly utilize structured representations, limiting advanced queries and analyses such as tracing model evolution and recommending relevant datasets. To fill the gap, we construct HuggingKG, the first large-scale knowledge graph built from the Hugging Face community for ML resource management. With 2.6 million nodes and 6.2 million edges, HuggingKG captures domain-specific relations and rich textual attributes. It enables us to further present HuggingBench, a multi-task benchmark with three novel test collections for IR tasks including resource recommendation, classification, and tracing. Our experiments reveal unique characteristics of HuggingKG and the derived tasks. Both resources are publicly available, expected to advance research in open source resource sharing and management.

15. 【2505.17330】FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

链接https://arxiv.org/abs/2505.17330

作者:Amit Agarwal,Srikant Panda,Kulbhushan Pachauri

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Domain Adapting Graph, Shot Domain Adapting, Adapting Graph, rich document understanding, visually rich document

备注: Published in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Industry Track, pages 100-114

点击查看摘要

Abstract:In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : this https URL

16. 【2505.17326】VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

链接https://arxiv.org/abs/2505.17326

作者:Zackary Rackauckas,Julia Hirschberg

类目:Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:retrieval-augmented generation system, retrieve semantically relevant, retrieval-augmented generation, CLAP audio embeddings, bypasses transcription

备注: Accepted to ACL 2025 Workshop MAGMaR

点击查看摘要

Abstract:We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by a native English speaker. Retrieval quality was evaluated using LLM-as-a-judge annotations. For very relevant segments, cosine similarity achieved a Recall@10 of 0.34. For somewhat relevant segments, Recall@10 rose to 0.60 and nDCG@10 to 0.27, highlighting strong topical alignment. Answer quality was judged on a 0--2 scale across relevance, accuracy, completeness, and precision, with mean scores of 0.84, 0.58, 0.56, and 0.46 respectively. While precision and retrieval quality remain key limitations, VoxRAG shows that transcription-free speech-to-speech retrieval is feasible in RAG systems.

17. 【2505.17207】Content Moderation in TV Search: Balancing Policy Compliance, Relevance, and User Experience

链接https://arxiv.org/abs/2505.17207

作者:Adeep Hande,Kishorekumar Sundararajan,Sardar Hamidian,Ferhan Ture

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Millions of people, entertainment platforms, people rely, functionality to find, find and explore

备注: Accepted at SIGIR 2025 Industry Track. 5 pages, 1 figure, 2 tables. DOI: [https://doi.org/10.1145/3726302.3731962](https://doi.org/10.1145/3726302.3731962)

点击查看摘要

Abstract:Millions of people rely on search functionality to find and explore content on entertainment platforms. Modern search systems use a combination of candidate generation and ranking approaches, with advanced methods leveraging deep learning and LLM-based techniques to retrieve, generate, and categorize search results. Despite these advancements, search algorithms can still surface inappropriate or irrelevant content due to factors like model unpredictability, metadata errors, or overlooked design flaws. Such issues can misalign with product goals and user expectations, potentially harming user trust and business outcomes. In this work, we introduce an additional monitoring layer using Large Language Models (LLMs) to enhance content moderation. This additional layer flags content if the user did not intend to search for it. This approach serves as a baseline for product quality assurance, with collected feedback used to refine the initial retrieval mechanisms of the search model, ensuring a safer and more reliable user experience.

18. 【2505.17166】ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

链接https://arxiv.org/abs/2505.17166

作者:Quentin Macé,António Loison,Manuel Faysse

类目:Information Retrieval (cs.IR)

关键词:top models exceeding, limiting its ability, discern improvements, ViDoRe Benchmark, approaching saturation

备注: Published as a HuggingFace Blog

点击查看摘要

Abstract:The ViDoRe Benchmark V1 was approaching saturation with top models exceeding 90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2 introduces realistic, challenging retrieval scenarios via blind contextual querying, long and cross-document queries, and a hybrid synthetic and human-in-the-loop query generation process. It comprises four diverse, multilingual datasets and provides clear evaluation instructions. Initial results demonstrate substantial room for advancement and highlight insights on model generalization and multilingual capability. This benchmark is designed as a living resource, inviting community contributions to maintain relevance through future evaluations.

19. 【2505.17162】DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

链接https://arxiv.org/abs/2505.17162

作者:Jiehan Cheng,Zhicheng Dou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:automatically updated dynamic, updated dynamic dataset, updates questions weekly, dynamic dataset, Wikipedia revision logs

备注

点击查看摘要

Abstract:We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.

20. 【2505.17125】NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction

链接https://arxiv.org/abs/2505.17125

作者:Soyeon Kim,Namhee Kim,Yeonwoo Jeong

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Model, opaque scoring practices, Effective evaluation, methods is crucial, hampered by static

备注: Web Data Record Extraction, Zero-Shot Extraction, Large Language Models (LLMs) Evaluation Framework, Comparative Analysis

点击查看摘要

Abstract:Effective evaluation of web data record extraction methods is crucial, yet hampered by static, domain-specific benchmarks and opaque scoring practices. This makes fair comparison between traditional algorithmic techniques, which rely on structural heuristics, and Large Language Model (LLM)-based approaches, offering zero-shot extraction across diverse layouts, particularly challenging. To overcome these limitations, we introduce a concrete evaluation framework. Our framework systematically generates evaluation datasets from arbitrary MHTML snapshots, annotates XPath-based supervision labels, and employs structure-aware metrics for consistent scoring, specifically preventing text hallucination and allowing only for the assessment of positional hallucination. It also incorporates preprocessing strategies to optimize input for LLMs while preserving DOM semantics: HTML slimming, Hierarchical JSON, and Flat JSON. Additionally, we created a publicly available synthetic dataset by transforming DOM structures and modifying content. We benchmark deterministic heuristic algorithms and off-the-shelf LLMs across these multiple input formats. Our benchmarking shows that Flat JSON input enables LLMs to achieve superior extraction accuracy (F1 score of 0.9567) and minimal hallucination compared to other input formats like Slimmed HTML and Hierarchical JSON. We establish a standardized foundation for rigorous benchmarking, paving the way for the next principled advancements in web data record extraction.

21. 【2505.17042】VLM-KG: Multimodal Radiology Knowledge Graph Generation

链接https://arxiv.org/abs/2505.17042

作者:Abdullah Abdullah,Seong Tae Kim

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Vision-Language Models, demonstrated remarkable success, excelling at instruction, structured output generation, demonstrated remarkable

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable success in natural language generation, excelling at instruction following and structured output generation. Knowledge graphs play a crucial role in radiology, serving as valuable sources of factual information and enhancing various downstream tasks. However, generating radiology-specific knowledge graphs presents significant challenges due to the specialized language of radiology reports and the limited availability of domain-specific data. Existing solutions are predominantly unimodal, meaning they generate knowledge graphs only from radiology reports while excluding radiographic images. Additionally, they struggle with long-form radiology data due to limited context length. To address these limitations, we propose a novel multimodal VLM-based framework for knowledge graph generation in radiology. Our approach outperforms previous methods and introduces the first multimodal solution for radiology knowledge graph generation.

计算机视觉

1. 【2505.18153】REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

链接https://arxiv.org/abs/2505.18153

作者:Savya Khosla,Sethuraman TV,Barnett Lee,Alexander Schwing,Derek Hoiem

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Region Encoder Network, generating region-based image, region-based image representations, Encoder Network, effective region representations

备注

点击查看摘要

Abstract:We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: this https URL.

2. 【2505.18151】WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

链接https://arxiv.org/abs/2505.18151

作者:Zizhang Li,Hong-Xing Yu,Wei Liu,Yin Yang,Charles Herrmann,Gordon Wetzstein,Jiajun Wu

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:framework integrating physics, integrating physics simulation, generating action-conditioned dynamic, framework integrating, generation for generating

备注: The first two authors contributed equally. Project website: [this https URL](https://kyleleey.github.io/WonderPlay/)

点击查看摘要

Abstract:WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: this https URL

3. 【2505.18142】okBench: Evaluating Your Visual Tokenizer before Visual Generation

链接https://arxiv.org/abs/2505.18142

作者:Junfeng Wu,Dongliang Luo,Weizhi Zhao,Zhihao Xie,Yuanhao Wang,Junyi Li,Xudong Xie,Yuliang Liu,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:challenging visual contents, reveal the limitations, visual, text, faces

备注: Benchmark, homepagee: [this https URL](https://wjf5203.github.io/TokBench)

点击查看摘要

Abstract:In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Image tokenization has significantly advanced visual generation and multimodal modeling, particularly with autoregressive models due to the modeling simplicity of discrete tokens. Autoregressive models typically rely on image tokenizers to compress images into discrete tokens for sequential prediction, whereas diffusion models often operate on continuous latent space to reduce computational costs. However, both visual compression approaches inevitably lose visual information, thereby limiting the upper bound of visual generation quality. To evaluate how these compression losses affect text and faces, the most human-sensitive visual elements, we first collect and curate a collection of text and faces images from existing datasets, ensuring clarity and diversity. For text reconstruction, we employ OCR models to assess the recognition accuracy of the reconstructed text, and then we measure feature similarity between original and reconstructed faces thereby quantifying faces reconstruction fidelity. Our method is highly lightweight, requiring just 2GB memory and 4 minutes to complete evaluations. With our benchmark, we analyze the reconstruction quality of text and faces at various scales across different image tokenizers and VAEs. Our results demonstrate that modern visual tokenizers still struggle to preserve fine-grained features, particularly at smaller scales. Furthermore, we extend this evaluation framework to the video, conducting a comprehensive analysis of video tokenizers. Additionally, we find that traditional metrics fail to accurately reflect the reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.

4. 【2505.18137】Boosting Open Set Recognition Performance through Modulated Representation Learning

链接https://arxiv.org/abs/2505.18137

作者:Amit Kumar Kundu,Vaishnavi Patil,Joseph Jaja

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:open set recognition, identify test samples, OSR methods, practical scenarios, OSR

备注

点击查看摘要

Abstract:The open set recognition (OSR) problem aims to identify test samples from novel semantic classes that are not part of the training classes, a task that is crucial in many practical scenarios. However, existing OSR methods use a constant scaling factor (the temperature) to the logits before applying a loss function, which hinders the model from exploring both ends of the spectrum in representation learning -- from instance-level to semantic-level features. In this paper, we address this problem by enabling temperature-modulated representation learning using our novel negative cosine scheduling scheme. Our scheduling lets the model form a coarse decision boundary at the beginning of training by focusing on fewer neighbors, and gradually prioritizes more neighbors to smooth out rough edges. This gradual task switching leads to a richer and more generalizable representation space. While other OSR methods benefit by including regularization or auxiliary negative samples, such as with mix-up, thereby adding a significant computational overhead, our scheme can be folded into any existing OSR method with no overhead. We implement the proposed scheme on top of a number of baselines, using both cross-entropy and contrastive loss functions as well as a few other OSR methods, and find that our scheme boosts both the OSR performance and the closed set performance in most cases, especially on the tougher semantic shift benchmarks.

5. 【2505.18134】VideoGameBench: Can Vision-Language Models complete popular video games?

链接https://arxiv.org/abs/2505.18134

作者:Alex L. Zhang,Thomas L. Griffiths,Karthik R. Narasimhan,Ofir Press

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved strong results, spatial navigation, remains understudied, memory management, achieved strong

备注: 9 pages, 33 pages including supplementary

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

6. 【2505.18132】BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision Models

链接https://arxiv.org/abs/2505.18132

作者:Dingqing Ye,Chao Fan,Zhanbo Huang,Chengwen Luo,Jianqiang Li,Shiqi Yu,Xiaoming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision models, Large vision, achieved impressive performance, Large, LVM

备注

点击查看摘要

Abstract:Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM's potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM's intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR\_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.

7. 【2505.18129】One RL to See Them All: Visual Triple Unified Reinforcement Learning

链接https://arxiv.org/abs/2505.18129

作者:Yan Ma,Linge Du,Xuyang Shen,Shaoxiang Chen,Pengfei Li,Qibing Ren,Lizhuang Ma,Yuchao Dai,Pengfei Liu,Junjie Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Reinforcement learning, Unified Reinforcement Learning, Reinforcement Learning system, Triple Unified Reinforcement, capabilities of vision-language

备注: Technical Report

点击查看摘要

Abstract:Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at this https URL.

8. 【2505.18115】Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

链接https://arxiv.org/abs/2505.18115

作者:Jacob Hansen,Wei Lin,Junmo Kang,Muhammad Jehanzeb Mirza,Hongyin Luo,Rogerio Feris,Alan Ritter,James Glass,Leonid Karlinsky

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Instruction Tuning, understand visual inputs, aligning strong LLMs, visual inputs, understand visual

备注

点击查看摘要

Abstract:Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at this https URL.

9. 【2505.18111】Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking

链接https://arxiv.org/abs/2505.18111

作者:Cheng-Yen Yang,Hsiang-Wei Huang,Pyong-Kun Kim,Chien-Kai Kuo,Jui-Wei Chang,Kwang-Ju Kim,Chung-I Huang,Jenq-Neng Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Object Tracking, Segment Anything Model, Multi-modal Object Tracking, adapting the Segment, Visual Object

备注: Accepted by ICPR Multi-Modal Visual Pattern Recognition Workshop

点击查看摘要

Abstract:We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.

10. 【2505.18106】F-ANcGAN: An Attention-Enhanced Cycle Consistent Generative Adversarial Architecture for Synthetic Image Generation of Nanoparticles

链接https://arxiv.org/abs/2505.18106

作者:Varun Ajith,Anindya Pal,Saumik Bhattacharya,Sayantari Ghosh

类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Nanomaterial research, area for energy, materials science, determine their properties, vital area

备注: 11 pages, 9 figures, 2 tables, conference paper

点击查看摘要

Abstract:Nanomaterial research is becoming a vital area for energy, medicine, and materials science, and accurate analysis of the nanoparticle topology is essential to determine their properties. Unfortunately, the lack of high-quality annotated datasets drastically hinders the creation of strong segmentation models for nanoscale imaging. To alleviate this problem, we introduce F-ANcGAN, an attention-enhanced cycle consistent generative adversarial system that can be trained using a limited number of data samples and generates realistic scanning electron microscopy (SEM) images directly from segmentation maps. Our model uses a Style U-Net generator and a U-Net segmentation network equipped with self-attention to capture structural relationships and applies augmentation methods to increase the variety of the dataset. The architecture reached a raw FID score of 17.65 for TiO$_2$ dataset generation, with a further reduction in FID score to nearly 10.39 by using efficient post-processing techniques. By facilitating scalable high-fidelity synthetic dataset generation, our approach can improve the effectiveness of downstream segmentation task training, overcoming severe data shortage issues in nanoparticle analysis, thus extending its applications to resource-limited fields.

11. 【2505.18097】owards more transferable adversarial attack in black-box manner

链接https://arxiv.org/abs/2505.18097

作者:Chun Tong Lei,Zhongliang Guo,Hon Chung Lee,Minh Quoc Duong,Chun Pong Lau

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion-based adversarial purification, well-explored domain, frequently serving, serving as evaluation, evaluation baselines

备注

点击查看摘要

Abstract:Adversarial attacks have become a well-explored domain, frequently serving as evaluation baselines for model robustness. Among these, black-box attacks based on transferability have received significant attention due to their practical applicability in real-world scenarios. Traditional black-box methods have generally focused on improving the optimization framework (e.g., utilizing momentum in MI-FGSM) to enhance transferability, rather than examining the dependency on surrogate white-box model architectures. Recent state-of-the-art approach DiffPGD has demonstrated enhanced transferability by employing diffusion-based adversarial purification models for adaptive attacks. The inductive bias of diffusion-based adversarial purification aligns naturally with the adversarial attack process, where both involving noise addition, reducing dependency on surrogate white-box model selection. However, the denoising process of diffusion models incurs substantial computational costs through chain rule derivation, manifested in excessive VRAM consumption and extended runtime. This progression prompts us to question whether introducing diffusion models is necessary. We hypothesize that a model sharing similar inductive bias to diffusion-based adversarial purification, combined with an appropriate loss function, could achieve comparable or superior transferability while dramatically reducing computational overhead. In this paper, we propose a novel loss function coupled with a unique surrogate model to validate our hypothesis. Our approach leverages the score of the time-dependent classifier from classifier-guided diffusion models, effectively incorporating natural data distribution knowledge into the adversarial optimization process. Experimental results demonstrate significantly improved transferability across diverse model architectures while maintaining robustness against diffusion-based defenses.

12. 【2505.18096】DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

链接https://arxiv.org/abs/2505.18096

作者:Ziqiao Peng,Yanbo Fan,Haoyu Wu,Xuan Wang,Hongyan Liu,Jun He,Zhaoxin Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:listening roles seamlessly, talking head generation, listening, speaking, talking head

备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task -- multi-round dual-speaker interaction for 3D talking head generation -- which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. We recommend watching the supplementary video: this https URL.

13. 【2505.18087】CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

链接https://arxiv.org/abs/2505.18087

作者:Hyungyung Lee,Geon Choi,Jung-Oh Lee,Hangyul Yoon,Hyuk Gi Hong,Edward Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, enabled promising applications, Recent progress, progress in Large, Large Vision-Language

备注

点击查看摘要

Abstract:Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at this https URL

14. 【2505.18079】Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

链接https://arxiv.org/abs/2505.18079

作者:Xiaoyi Zhang,Zhaoyang Jia,Zongyu Guo,Jiahao Li,Bin Li,Houqiang Li,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:presents significant challenges, significant challenges due, extensive temporal-spatial complexity, Large Language Models, understanding presents significant

备注: Under review

点击查看摘要

Abstract:Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

15. 【2505.18078】DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

链接https://arxiv.org/abs/2505.18078

作者:Junhao Chen,Mingjin Chen,Jianjin Xu,Xiang Li,Junting Dong,Mingze Sun,Puhua Jiang,Hongxiang Li,Yuhang Yang,Hao Zhao,Xiaoxiao Long,Ruqi Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:noisy control signals, current systems falter, advanced rapidly, actor must move, control signals

备注: Our video demos and code are available at [this https URL](https://DanceTog.github.io/)

点击查看摘要

Abstract:Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at this https URL.

16. 【2505.18060】Semantic Correspondence: Unified Benchmarking and a Strong Baseline

链接https://arxiv.org/abs/2505.18060

作者:Kaiyan Zhang,Xinghui Li,Jingyi Lu,Kai Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Establishing semantic correspondence, Establishing semantic, computer vision, aiming to match, match keypoints

备注

点击查看摘要

Abstract:Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: this https URL.

17. 【2505.18053】FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

链接https://arxiv.org/abs/2505.18053

作者:Zherui Zhang,Jiaxin Wu,Changwei Wang,Rongtao Xu,Longzhao Huang,Wenhao Xu,Wenbo Xu,Li Guo,Shibiao Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:adapt Vision-Language Models, textbf, Prompt learning, large, widely adopted

备注

点击查看摘要

Abstract:Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {\large {\textbf{F}}}aster {\large {\textbf{D}}}istillation-{\large {\textbf{B}}}ased {\large {\textbf{P}}}rompt {\large {\textbf{L}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.

18. 【2505.18052】BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching

链接https://arxiv.org/abs/2505.18052

作者:Zhihua Liu,Lei Tong,Xilin He,Che Liu,Rossella Arcucci,Chen Jin,Huiyu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existed echocardiography segmentation, inconsistency challenge caused, Existed echocardiography, false positive segmentation, anatomical inconsistency challenge

备注

点击查看摘要

Abstract:Existed echocardiography segmentation methods often suffer from anatomical inconsistency challenge caused by shape variation, partial observation and region ambiguity with similar intensity across 2D echocardiographic sequences, resulting in false positive segmentation with anatomical defeated structures in challenging low signal-to-noise ratio conditions. To provide a strong anatomical guarantee across different echocardiographic frames, we propose a novel segmentation framework named BOTM (Bi-directional Optimal Token Matching) that performs echocardiography segmentation and optimal anatomy transportation simultaneously. Given paired echocardiographic images, BOTM learns to match two sets of discrete image tokens by finding optimal correspondences from a novel anatomical transportation perspective. We further extend the token matching into a bi-directional cross-transport attention proxy to regulate the preserved anatomical consistency within the cardiac cyclic deformation in temporal domain. Extensive experimental results show that BOTM can generate stable and accurate segmentation outcomes (e.g. -1.917 HD on CAMUS2H LV, +1.9% Dice on TED), and provide a better matching interpretation with anatomical consistency guarantee.

19. 【2505.18051】LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

链接https://arxiv.org/abs/2505.18051

作者:Anthony Fuller,Yousef Yassin,Junfeng Wen,Daniel G. Kyrollos,Tarek Ibrahim,James R. Green,Evan Shelhamer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision transformers, Vision, compute, Abstract, tokens grows quadratically

备注

点击查看摘要

Abstract:Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.

20. 【2505.18049】SpikeGen: Generative Framework for Visual Spike Stream Processing

链接https://arxiv.org/abs/2505.18049

作者:Gaole Dai,Menghang Dong,Rongyu Zhang,Ruichuan An,Shanghang Zhang,Tiejun Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted considerable attention, considerable attention due, capture clear textures, Neuromorphic Visual Systems, Neuromorphic Visual

备注

点击查看摘要

Abstract:Neuromorphic Visual Systems, such as spike cameras, have attracted considerable attention due to their ability to capture clear textures under dynamic conditions. This capability effectively mitigates issues related to motion and aperture blur. However, in contrast to conventional RGB modalities that provide dense spatial information, these systems generate binary, spatially sparse frames as a trade-off for temporally rich visual streams. In this context, generative models emerge as a promising solution to address the inherent limitations of sparse data. These models not only facilitate the conditional fusion of existing information from both spike and RGB modalities but also enable the conditional generation based on latent priors. In this study, we introduce a robust generative processing framework named SpikeGen, designed for visual spike streams captured by spike cameras. We evaluate this framework across multiple tasks involving mixed spike-RGB modalities, including conditional image/video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by comprehensive experimental results, we demonstrate that leveraging the latent space operation abilities of generative models allows us to effectively address the sparsity of spatial information while fully exploiting the temporal richness of spike streams, thereby promoting a synergistic enhancement of different visual modalities.

21. 【2505.18048】SHARDeg: A Benchmark for Skeletal Human Action Recognition in Degraded Scenarios

链接https://arxiv.org/abs/2505.18048

作者:Simon Malzard,Nitish Mital,Richard Walters,Victoria Nockles,Raghuveer Rao,Celso M. De Melo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:classification tasks operate, Computer vision, Human Action Recognition, Skeletal Human Action, prediction or classification

备注: 19 pages, 2 images

点击查看摘要

Abstract:Computer vision (CV) models for detection, prediction or classification tasks operate on video data-streams that are often degraded in the real world, due to deployment in real-time or on resource-constrained hardware. It is therefore critical that these models are robust to degraded data, but state of the art (SoTA) models are often insufficiently assessed with these real-world constraints in mind. This is exemplified by Skeletal Human Action Recognition (SHAR), which is critical in many CV pipelines operating in real-time and at the edge, but robustness to degraded data has previously only been shallowly and inconsistently assessed. Here we address this issue for SHAR by providing an important first data degradation benchmark on the most detailed and largest 3D open dataset, NTU-RGB+D-120, and assess the robustness of five leading SHAR models to three forms of degradation that represent real-world issues. We demonstrate the need for this benchmark by showing that the form of degradation, which has not previously been considered, has a large impact on model accuracy; at the same effective frame rate, model accuracy can vary by 40% depending on degradation type. We also identify that temporal regularity of frames in degraded SHAR data is likely a major driver of differences in model performance, and harness this to improve performance of existing models by up to 40%, through employing a simple mitigation approach based on interpolation. Finally, we highlight how our benchmark has helped identify an important degradation-resistant SHAR model based in Rough Path Theory; the LogSigRNN SHAR model outperforms the SoTA DeGCN model in five out of six cases at low frame rates by an average accuracy of 6%, despite trailing the SoTA model by 11-12% on un-degraded data at high frame rates (30 FPS).

22. 【2505.18047】RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration

链接https://arxiv.org/abs/2505.18047

作者:Sudarshan Rajagopalan,Kartik Narayan,Vishal M. Patel

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Stable Diffusion, improved the perceptual, perceptual quality, latent diffusion models, Stable

备注: Project page: [this https URL](https://sudraj2002.github.io/restorevarpage/)

点击查看摘要

Abstract:The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. To address this, we propose RestoreVAR, a novel generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $\mathbf{10\times}$ faster inference. RestoreVAR leverages visual autoregressive modeling (VAR), a recently introduced approach which performs scale-space autoregression for image generation. VAR achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. To optimally exploit these advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.

23. 【2505.18039】Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

链接https://arxiv.org/abs/2505.18039

作者:Li Zhong,Ahmed Ghazal,Jun-Jun Wan,Frederik Zilly,Patrick Mackens,Joachim E. Vollrath,Bogdan Sorin Coseriu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrastive Language-Image Pretraining, Contrastive Language-Image, Language-Image Pretraining, revolutionized vision-language tasks, tasks by enabling

备注

点击查看摘要

Abstract:Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.

24. 【2505.18035】CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention

链接https://arxiv.org/abs/2505.18035

作者:Naseem Khan,Tuan Nguyen,Amine Bermak,Issa Khalil

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses critical challenges, digital media authentication, sophisticated AI-generated deepfakes, AI-generated deepfakes poses, deepfakes poses critical

备注: 20 pages, 8 figures, 12 Tables

点击查看摘要

Abstract:The proliferation of sophisticated AI-generated deepfakes poses critical challenges for digital media authentication and societal security. While existing detection methods perform well within specific generative domains, they exhibit significant performance degradation when applied to manipulations produced by unseen architectures--a fundamental limitation as generative technologies rapidly evolve. We propose CAMME (Cross-Attention Multi-Modal Embeddings), a framework that dynamically integrates visual, textual, and frequency-domain features through a multi-head cross-attention mechanism to establish robust cross-domain generalization. Extensive experiments demonstrate CAMME's superiority over state-of-the-art methods, yielding improvements of 12.56% on natural scenes and 13.25% on facial deepfakes. The framework demonstrates exceptional resilience, maintaining (over 91%) accuracy under natural image perturbations and achieving 89.01% and 96.14% accuracy against PGD and FGSM adversarial attacks, respectively. Our findings validate that integrating complementary modalities through cross-attention enables more effective decision boundary realignment for reliable deepfake detection across heterogeneous generative architectures.

25. 【2505.18032】Mahalanobis++: Improving OOD Detection via Feature Normalization

链接https://arxiv.org/abs/2505.18032

作者:Maximilian Mueller,Matthias Hein

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:deploying reliable machine, reliable machine learning, machine learning models, safety-critial applications, important task

备注

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) examples is an important task for deploying reliable machine learning models in safety-critial applications. While post-hoc methods based on the Mahalanobis distance applied to pre-logit features are among the most effective for ImageNet-scale OOD detection, their performance varies significantly across models. We connect this inconsistency to strong variations in feature norms, indicating severe violations of the Gaussian assumption underlying the Mahalanobis distance estimation. We show that simple $\ell_2$-normalization of the features mitigates this problem effectively, aligning better with the premise of normally distributed data with shared covariance matrix. Extensive experiments on 44 models across diverse architectures and pretraining schemes show that $\ell_2$-normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods.

26. 【2505.18028】Knot So Simple: A Minimalistic Environment for Spatial Reasoning

链接https://arxiv.org/abs/2505.18028

作者:Zizhao Chen,Yoav Artzi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:environment for complex, interactive environment, spatial reasoning, Abstract, KnotGym

备注

点击查看摘要

Abstract:We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at this https URL.

27. 【2505.18025】3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation

链接https://arxiv.org/abs/2505.18025

作者:Evangelos Sariyanidi,Claudio Ferrari,Federico Nocentini,Stefano Berretti,Andrea Cavallaro,Birkan Tunc

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:standard benchmark metric, Computing the standard, Face reconstruction Benchmark, face reconstruction, requires a number

备注: To be published in IEEE International Conference on Automatic Face and Gesture Recognition, 2025

点击查看摘要

Abstract:Computing the standard benchmark metric for 3D face reconstruction, namely geometric error, requires a number of steps, such as mesh cropping, rigid alignment, or point correspondence. Current benchmark tools are monolithic (they implement a specific combination of these steps), even though there is no consensus on the best way to measure error. We present a toolkit for a Modularized 3D Face reconstruction Benchmark (M3DFB), where the fundamental components of error computation are segregated and interchangeable, allowing one to quantify the effect of each. Furthermore, we propose a new component, namely correction, and present a computationally efficient approach that penalizes for mesh topology inconsistency. Using this toolkit, we test 16 error estimators with 10 reconstruction methods on two real and two synthetic datasets. Critically, the widely used ICP-based estimator provides the worst benchmarking performance, as it significantly alters the true ranking of the top-5 reconstruction methods. Notably, the correlation of ICP with the true error can be as low as 0.41. Moreover, non-rigid alignment leads to significant improvement (correlation larger than 0.90), highlighting the importance of annotating 3D landmarks on datasets. Finally, the proposed correction scheme, together with non-rigid warping, leads to an accuracy on a par with the best non-rigid ICP-based estimators, but runs an order of magnitude faster. Our open-source codebase is designed for researchers to easily compare alternatives for each component, thus helping accelerating progress in benchmarking for 3D face reconstruction and, furthermore, supporting the improvement of learned reconstruction methods, which depend on accurate error estimation for effective training.

28. 【2505.18024】A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

链接https://arxiv.org/abs/2505.18024

作者:Xiaobao Wei,Jiawei Liu,Dongbo Yang,Junda Cheng,Changyong Shu,Wei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:EPE evaluation metrics, RAFT-stereo converge inconsistently, EPE evaluation, resulting high frequency, high frequency degradation

备注

点击查看摘要

Abstract:We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (this https URL).

29. 【2505.18022】RemoteSAM: Towards Segment Anything for Earth Observation

链接https://arxiv.org/abs/2505.18022

作者:Liang Yao,Fan Liu,Delong Chen,Chuanyi Zhang,Yijun Wang,Ziyun Chen,Wei Xu,Shimin Di,Yuhui Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aim to develop, develop a robust, robust yet flexible, flexible visual foundation, flexible visual

备注

点击查看摘要

Abstract:We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at this https URL.

30. 【2505.18021】Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method

链接https://arxiv.org/abs/2505.18021

作者:Yao Sun,Sining Chen,Yifan Tian,Xiao Xiang Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:utility provision, risk assessment, evacuation planning, household estimation, energy modeling

备注: Code and data: [this https URL](https://github.com/ya0-sun/Munich-SVI-Floor-Benchmark)

点击查看摘要

Abstract:Accurate information on the number of building floors, or above-ground storeys, is essential for household estimation, utility provision, risk assessment, evacuation planning, and energy modeling. Yet large-scale floor-count data are rarely available in cadastral and 3D city databases. This study proposes an end-to-end deep learning framework that infers floor numbers directly from unrestricted, crowdsourced street-level imagery, avoiding hand-crafted features and generalizing across diverse facade styles. To enable benchmarking, we release the Munich Building Floor Dataset, a public set of over 6800 geo-tagged images collected from Mapillary and targeted field photography, each paired with a verified storey label. On this dataset, the proposed classification-regression network attains 81.2% exact accuracy and predicts 97.9% of buildings within +/-1 floor. The method and dataset together offer a scalable route to enrich 3D city models with vertical information and lay a foundation for future work in urban informatics, remote sensing, and geographic information science. Source code and data will be released under an open license at this https URL.

31. 【2505.18015】SemSegBench DetecBench: Benchmarking Reliability and Generalization Beyond Classification

链接https://arxiv.org/abs/2505.18015

作者:Shashank Agnihotri,David Schader,Jonas Jakubassa,Nico Sharei,Simon Kral,Mehmet Ege Kaçar,Ruben Weber,Margret Keuper

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deep learning, learning are predominantly, predominantly studied, context of image, SEMSEGBENCH and DETECBENCH

备注: First seven listed authors have equal contribution. GitHub: [this https URL](https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) . arXiv admin note: text overlap with [arXiv:2505.05091](https://arxiv.org/abs/2505.05091)

点击查看摘要

Abstract:Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (this https URL) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.

32. 【2505.18010】Clinical Validation of Deep Learning for Real-Time Tissue Oxygenation Estimation Using Spectral Imaging

链接https://arxiv.org/abs/2505.18010

作者:Jens De Winne,Siri Willems,Siri Luthman,Danilo Babin,Hiep Luong,Wim Ceelen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understand tissue health, ischemia is crucial, crucial to understand, health and guide, Accurate

备注: Provisionally accepted to the MICCAI 2025 conference

点击查看摘要

Abstract:Accurate, real-time monitoring of tissue ischemia is crucial to understand tissue health and guide surgery. Spectral imaging shows great potential for contactless and intraoperative monitoring of tissue oxygenation. Due to the difficulty of obtaining direct reference oxygenation values, conventional methods are based on linear unmixing techniques. These are prone to assumptions and these linear relations may not always hold in practice. In this work, we present deep learning approaches for real-time tissue oxygenation estimation using Monte-Carlo simulated spectra. We train a fully connected neural network (FCN) and a convolutional neural network (CNN) for this task and propose a domain-adversarial training approach to bridge the gap between simulated and real clinical spectral data. Results demonstrate that these deep learning models achieve a higher correlation with capillary lactate measurements, a well-known marker of hypoxia, obtained during spectral imaging in surgery, compared to traditional linear unmixing. Notably, domain-adversarial training effectively reduces the domain gap, optimizing performance in real clinical settings.

33. 【2505.17994】Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

链接https://arxiv.org/abs/2505.17994

作者:Zhihua Liu,Amrutha Saseendran,Lei Tong,Xilin He,Fariba Yousefi,Nikolay Burlutskiy,Dino Oglic,Tom Diethe,Philip Teare,Huiyu Zhou,Chen Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:text reference expressions, demand extensive training, diverse text reference, unified objects consistently, segment unified objects

备注

点击查看摘要

Abstract:Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.

34. 【2505.17992】Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets

链接https://arxiv.org/abs/2505.17992

作者:Fahd Alhamazani,Yu-Kun Lai,Paul L. Rosin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presents unique challenges, unique challenges due, presents unique, unique challenges, challenges due

备注

点击查看摘要

Abstract:3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges due to the significant range of possible deformations. Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space. This study addresses these limitations by proposing a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form. This alignment facilitates shape reconstruction by enabling the application of rigid object reconstruction techniques, and supports recovering the input pose in voxel representation as part of the reconstruction task, utilizing both the original and deformed depth images. Notably, our model achieves effective results with only a small dataset of approximately 300 samples. Experimental results on animal and human datasets demonstrate that our model outperforms other state-of-the-art methods.

35. 【2505.17982】Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

链接https://arxiv.org/abs/2505.17982

作者:Bryan Wong,Jong Woo Kim,Huazhu Fu,Mun Yong Yi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:weakly supervised classification, multiple instance learning, challenge of few-shot, slide images, recently been integrated

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at this https URL

36. 【2505.17973】o Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models

链接https://arxiv.org/abs/2505.17973

作者:Simone Gaisbauer,Prabin Gyawali,Qilin Zhang,Olaf Wysocki,Boris Jutzi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:feature matching methods, learnable feature matching, Feature matching, SIFT feature detection, computer vision

备注: Accepted to MMT, Xiamen, China; ISPRS Annals

点击查看摘要

Abstract:Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: this https URL\_Glue\_or\_not\_to\_Glue

37. 【2505.17972】MR-EEGWaveNet: Multiresolutional EEGWaveNet for Seizure Detection from Long EEG Recordings

链接https://arxiv.org/abs/2505.17972

作者:Kazi Mahmudul Hassan,Xuyang Zhao,Hidenori Sugano,Toshihisa Tanaka

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:generalized seizure detection, detection models remains, seizure detection models, significant challenge, engineering for generalized

备注: 26 pages, 6 figures, 12 tables

点击查看摘要

Abstract:Feature engineering for generalized seizure detection models remains a significant challenge. Recently proposed models show variable performance depending on the training data and remain ineffective at accurately distinguishing artifacts from seizure data. In this study, we propose a novel end-to-end model, ''Multiresolutional EEGWaveNet (MR-EEGWaveNet),'' which efficiently distinguishes seizure events from background electroencephalogram (EEG) and artifacts/noise by capturing both temporal dependencies across different time frames and spatial relationships between channels. The model has three modules: convolution, feature extraction, and predictor. The convolution module extracts features through depth-wise and spatio-temporal convolution. The feature extraction module individually reduces the feature dimension extracted from EEG segments and their sub-segments. Subsequently, the extracted features are concatenated into a single vector for classification using a fully connected classifier called the predictor module. In addition, an anomaly score-based post-classification processing technique was introduced to reduce the false-positive rates of the model. Experimental results were reported and analyzed using different parameter settings and datasets (Siena (public) and Juntendo (private)). The proposed MR-EEGWaveNet significantly outperformed the conventional non-multiresolution approach, improving the F1 scores from 0.177 to 0.336 on Siena and 0.327 to 0.488 on Juntendo, with precision gains of 15.9% and 20.62%, respectively.

38. 【2505.17966】Is Single-View Mesh Reconstruction Ready for Robotics?

链接https://arxiv.org/abs/2505.17966

作者:Frederik Nolte,Bernhard Schölkopf,Ingmar Posner

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:digital twin environments, creating digital twin, mesh reconstruction models, paper evaluates single-view, evaluates single-view mesh

备注: 20 pages, 17 figures

点击查看摘要

Abstract:This paper evaluates single-view mesh reconstruction models for creating digital twin environments in robot manipulation. Recent advances in computer vision for 3D reconstruction from single viewpoints present a potential breakthrough for efficiently creating virtual replicas of physical environments for robotics contexts. However, their suitability for physics simulations and robotics applications remains unexplored. We establish benchmarking criteria for 3D reconstruction in robotics contexts, including handling typical inputs, producing collision-free and stable reconstructions, managing occlusions, and meeting computational constraints. Our empirical evaluation using realistic robotics datasets shows that despite success on computer vision benchmarks, existing approaches fail to meet robotics-specific requirements. We quantitively examine limitations of single-view reconstruction for practical robotics implementation, in contrast to prior work that focuses on multi-view approaches. Our findings highlight critical gaps between computer vision advances and robotics needs, guiding future research at this intersection.

39. 【2505.17959】Mind the Domain Gap: Measuring the Domain Gap Between Real-World and Synthetic Point Clouds for Automated Driving Development

链接https://arxiv.org/abs/2505.17959

作者:Nguyen Duc,Yan-Ling Lai,Patrick Madlindl,Xinyuan Zhu,Benedikt Schwab,Olaf Wysocki,Ludwig Hoegner,Thomas H. Kolbe

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:data distribution issues, typical long-tail data, long-tail data distribution, computer vision research, domain gap

备注: Submitted to PFG Journal of Photogrammetry, Remote Sensing and Geoinformation Science

点击查看摘要

Abstract:Owing to the typical long-tail data distribution issues, simulating domain-gap-free synthetic data is crucial in robotics, photogrammetry, and computer vision research. The fundamental challenge pertains to credibly measuring the difference between real and simulated data. Such a measure is vital for safety-critical applications, such as automated driving, where out-of-domain samples may impact a car's perception and cause fatal accidents. Previous work has commonly focused on simulating data on one scene and analyzing performance on a different, real-world scene, hampering the disjoint analysis of domain gap coming from networks' deficiencies, class definitions, and object representation. In this paper, we propose a novel approach to measuring the domain gap between the real world sensor observations and simulated data representing the same location, enabling comprehensive domain gap analysis. To measure such a domain gap, we introduce a novel metric DoGSS-PCL and evaluation assessing the geometric and semantic quality of the simulated point cloud. Our experiments corroborate that the introduced approach can be used to measure the domain gap. The tests also reveal that synthetic semantic point clouds may be used for training deep neural networks, maintaining the performance at the 50/50 real-to-synthetic ratio. We strongly believe that this work will facilitate research on credible data simulation and allow for at-scale deployment in automated driving testing and digital twinning.

40. 【2505.17955】Diffusion Classifiers Understand Compositionality, but Conditions Apply

链接https://arxiv.org/abs/2505.17955

作者:Yujin Jeong,Arnas Uselis,Seong Joon Oh,Anna Rohrbach

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Understanding visual scenes, human intelligence, Understanding visual, fundamental to human, diffusion

备注

点击查看摘要

Abstract:Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at this https URL.

41. 【2505.17951】SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes

链接https://arxiv.org/abs/2505.17951

作者:Haihong Xiao,Jianan Zou,Yuxin Zhou,Ying He,Wenxiong Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structure-view collaborative Gaussian, collaborative Gaussian splatting, Gaussian splatting framework, complex outdoor environments, outdoor environments

备注

点击查看摘要

Abstract:We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achieved through a novel hierarchical compensation strategy, ensuring both global consistency and local detail preservation; and (2) a cross-view assisted training strategy that enhances multi-view consistency by synchronizing gradient updates across viewpoints, applying visibility-aware densification, and pruning overfitted or inaccurate Gaussians based on structural consistency. Through joint optimization of structural representation and multi-view coherence, SplatCo effectively reconstructs fine-grained geometric structures and complex textures in large-scale scenes. Comprehensive evaluations on 13 diverse large-scale scenes, including Mill19, MatrixCity, Tanks Temples, WHU, and custom aerial captures, demonstrate that SplatCo consistently achieves higher reconstruction quality than state-of-the-art methods, with PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2. These results establish a new benchmark for high-fidelity rendering of large-scale unbounded scenes. Code and additional information are available at this https URL.

42. 【2505.17931】AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

链接https://arxiv.org/abs/2505.17931

作者:Xingjian Li,Qifeng Wu,Colleen Que,Yiran Ding,Adithya S. Ubaradka,Jianhua Xing,Tianyang Wang,Min Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:extensive expert effort, current deep learning, demand extensive expert, annotating large training, deep learning methods

备注

点击查看摘要

Abstract:Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline performs competitively with weakly-prompted interactive foundation models.

43. 【2505.17921】Evaluation of Few-Shot Learning Methods for Kidney Stone Type Recognition in Ureteroscopy

链接https://arxiv.org/abs/2505.17921

作者:Carlos Salazar-Ruiz,Francisco Lopez-Tiro,Ivan Reyes-Amezcua,Clement Larose,Gilberto Ochoa-Ruiz,Christian Daul

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Determining the type, prevent recurrence, kidney stones, kidney stone types, crucial for prescribing

备注: 6 pages, 3 figures, 3 tables, conference, cbms25

点击查看摘要

Abstract:Determining the type of kidney stones is crucial for prescribing appropriate treatments to prevent recurrence. Currently, various approaches exist to identify the type of kidney stones. However, obtaining results through the reference ex vivo identification procedure can take several weeks, while in vivo visual recognition requires highly trained specialists. For this reason, deep learning models have been developed to provide urologists with an automated classification of kidney stones during ureteroscopies. Nevertheless, a common issue with these models is the lack of training data. This contribution presents a deep learning method based on few-shot learning, aimed at producing sufficiently discriminative features for identifying kidney stone types in endoscopic images, even with a very limited number of samples. This approach was specifically designed for scenarios where endoscopic images are scarce or where uncommon classes are present, enabling classification even with a limited training dataset. The results demonstrate that Prototypical Networks, using up to 25% of the training data, can achieve performance equal to or better than traditional deep learning models trained with the complete dataset.

44. 【2505.17911】Object-level Cross-view Geo-localization with Location Enhancement and Multi-Head Cross Attention

链接https://arxiv.org/abs/2505.17911

作者:Zheyang Huang,Jagannath Aryal,Saeid Nahavandi,Xuequan Lu,Chee Peng Lim,Lei Wei,Hailing Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Cross-view geo-localization determines, ground-based camera, Object-level Cross-view Geo-localization, Cross-view Geo-localization Network, geo-referenced satellite image

备注

点击查看摘要

Abstract:Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific object with a single click on a drone image to retrieve precise geo-tagged information of the object. However, variations in viewpoints, timing, and imaging conditions pose significant challenges, especially when identifying visually similar objects in extensive satellite imagery. To address these challenges, we propose an Object-level Cross-view Geo-localization Network (OCGNet). It integrates user-specified click locations using Gaussian Kernel Transfer (GKT) to preserve location information throughout the network. This cue is dually embedded into the feature encoder and feature matching blocks, ensuring robust object-specific localization. Additionally, OCGNet incorporates a Location Enhancement (LE) module and a Multi-Head Cross Attention (MHCA) module to adaptively emphasize object-specific features or expand focus to relevant contextual regions when necessary. OCGNet achieves state-of-the-art performance on a public dataset, CVOGL. It also demonstrates few-shot learning capabilities, effectively generalizing from limited examples, making it suitable for diverse applications (this https URL).

45. 【2505.17910】DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

链接https://arxiv.org/abs/2505.17910

作者:Bin Wu,Wei Wang,Yahui Liu,Zixiang Li,Yao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Reward Feedback Learning, recently shown great, shown great potential, aligning model outputs, Feedback Learning

备注: 22 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: this https URL.

46. 【2505.17908】ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

链接https://arxiv.org/abs/2505.17908

作者:Litao Guo(1),Xinli Xu(1),Luozhou Wang(1),Jiantao Lin(1),Jinsong Zhou(1),Zixin Zhang(1),Bolan Su(3),Ying-Cong Chen(1 and 2) ((1) HKUST (GZ), (2) HKUST, (3) Bytedance)

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:gained increasing attention, unify diverse tasks, rapid advancement, gained increasing, increasing attention

备注: Project page: [this https URL](https://github.com/LitaoGuo/ComfyMind)

点击查看摘要

Abstract:With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems. Project page: this https URL

47. 【2505.17905】Semantic segmentation with reward

链接https://arxiv.org/abs/2505.17905

作者:Xie Ting,Ye Huang,Zhilin Liu,Lixin Duan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantic segmentation network, semantic segmentation, real-world scenarios, segmentation network, segmentation

备注: Tech report

点击查看摘要

Abstract:In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.

48. 【2505.17893】Pixels to Prognosis: Harmonized Multi-Region CT-Radiomics and Foundation-Model Signatures Across Multicentre NSCLC Data

链接https://arxiv.org/abs/2505.17893

作者:Shruti Atul Mali,Zohaib Salahuddin,Danial Khan,Yumeng Zhang,Henry C. Woodruff,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Luis Marti-Bonmati,Philippe Lambin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cell lung cancer, non-small cell lung, pretrained foundation model, pretrained foundation, evaluate the impact

备注

点击查看摘要

Abstract:Purpose: To evaluate the impact of harmonization and multi-region CT image feature integration on survival prediction in non-small cell lung cancer (NSCLC) patients, using handcrafted radiomics, pretrained foundation model (FM) features, and clinical data from a multicenter dataset. Methods: We analyzed CT scans and clinical data from 876 NSCLC patients (604 training, 272 test) across five centers. Features were extracted from the whole lung, tumor, mediastinal nodes, coronary arteries, and coronary artery calcium (CAC). Handcrafted radiomics and FM deep features were harmonized using ComBat, reconstruction kernel normalization (RKN), and RKN+ComBat. Regularized Cox models predicted overall survival; performance was assessed using the concordance index (C-index), 5-year time-dependent area under the curve (t-AUC), and hazard ratio (HR). SHapley Additive exPlanations (SHAP) values explained feature contributions. A consensus model used agreement across top region of interest (ROI) models to stratify patient risk. Results: TNM staging showed prognostic utility (C-index = 0.67; HR = 2.70; t-AUC = 0.85). The clinical + tumor radiomics model with ComBat achieved a C-index of 0.7552 and t-AUC of 0.8820. FM features (50-voxel cubes) combined with clinical data yielded the highest performance (C-index = 0.7616; t-AUC = 0.8866). An ensemble of all ROIs and FM features reached a C-index of 0.7142 and t-AUC of 0.7885. The consensus model, covering 78% of valid test cases, achieved a t-AUC of 0.92, sensitivity of 97.6%, and specificity of 66.7%. Conclusion: Harmonization and multi-region feature integration improve survival prediction in multicenter NSCLC data. Combining interpretable radiomics, FM features, and consensus modeling enables robust risk stratification across imaging centers.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2505.17893 [cs.CV]

(or
arXiv:2505.17893v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2505.17893

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shruti Atul Mali [view email] [v1]
Fri, 23 May 2025 13:41:52 UTC (1,829 KB)

49. 【2505.17884】rack Anything Annotate: Video annotation and dataset generation of computer vision models

链接https://arxiv.org/abs/2505.17884

作者:Nikita Ivanov,Mark Klimov,Dmitry Glukhikh,Tatiana Chernysheva,Igor Glukhikh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern machine learning, machine learning methods, learning methods require, methods require significant, require significant amounts

备注: 9 pages, 11 figures

点击查看摘要

Abstract:Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at this https URL

50. 【2505.17883】FastCAV: Efficient Computation of Concept Activation Vectors for Explaining Deep Neural Networks

链接https://arxiv.org/abs/2505.17883

作者:Laines Schmalwasser,Niklas Penzel,Joachim Denzler,Julia Niebling

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:understand the world, Concept Activation Vectors, humans understand, Activation Vectors, Concepts

备注: Accepted at ICML 2025, 27 pages, 20 figures, 9 tables

点击查看摘要

Abstract:Concepts such as objects, patterns, and shapes are how humans understand the world. Building on this intuition, concept-based explainability methods aim to study representations learned by deep neural networks in relation to human-understandable concepts. Here, Concept Activation Vectors (CAVs) are an important tool and can identify whether a model learned a concept or not. However, the computational cost and time requirements of existing CAV computation pose a significant challenge, particularly in large-scale, high-dimensional architectures. To address this limitation, we introduce FastCAV, a novel approach that accelerates the extraction of CAVs by up to 63.6x (on average 46.4x). We provide a theoretical foundation for our approach and give concrete assumptions under which it is equivalent to established SVM-based methods. Our empirical results demonstrate that CAVs calculated with FastCAV maintain similar performance while being more efficient and stable. In downstream applications, i.e., concept-based explanation methods, we show that FastCAV can act as a replacement leading to equivalent insights. Hence, our approach enables previously infeasible investigations of deep models, which we demonstrate by tracking the evolution of concepts during model training.

51. 【2505.17881】Hyperspectral Anomaly Detection Fused Unified Nonconvex Tensor Ring Factors Regularization

链接https://arxiv.org/abs/2505.17881

作者:Wenjin Qin,Hailin Wang,Hao Shu,Feng Zhang,Jianjun Wang,Xiangyong Cao,Xi-Le Zhao,Gemine Vivone

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:gained significant attention, recent years, remote sensing, gained significant, significant attention

备注

点击查看摘要

Abstract:In recent years, tensor decomposition-based approaches for hyperspectral anomaly detection (HAD) have gained significant attention in the field of remote sensing. However, existing methods often fail to fully leverage both the global correlations and local smoothness of the background components in hyperspectral images (HSIs), which exist in both the spectral and spatial domains. This limitation results in suboptimal detection performance. To mitigate this critical issue, we put forward a novel HAD method named HAD-EUNTRFR, which incorporates an enhanced unified nonconvex tensor ring (TR) factors regularization. In the HAD-EUNTRFR framework, the raw HSIs are first decomposed into background and anomaly components. The TR decomposition is then employed to capture the spatial-spectral correlations within the background component. Additionally, we introduce a unified and efficient nonconvex regularizer, induced by tensor singular value decomposition (TSVD), to simultaneously encode the low-rankness and sparsity of the 3-D gradient TR factors into a unique concise form. The above characterization scheme enables the interpretable gradient TR factors to inherit the low-rankness and smoothness of the original background. To further enhance anomaly detection, we design a generalized nonconvex regularization term to exploit the group sparsity of the anomaly component. To solve the resulting doubly nonconvex model, we develop a highly efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) framework. Experimental results on several benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art (SOTA) approaches in terms of detection accuracy.

52. 【2505.17867】Multi-task Learning For Joint Action and Gesture Recognition

链接https://arxiv.org/abs/2505.17867

作者:Konstantinos Spathis,Nikolaos Kardaris,Petros Maragos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision tasks, practical applications, computer vision, addressed simultaneously, action and gesture

备注

点击查看摘要

Abstract:In practical applications, computer vision tasks often need to be addressed simultaneously. Multitask learning typically achieves this by jointly training a single deep neural network to learn shared representations, providing efficiency and improving generalization. Although action and gesture recognition are closely related tasks, since they focus on body and hand movements, current state-of-the-art methods handle them separately. In this paper, we show that employing a multi-task learning paradigm for action and gesture recognition results in more efficient, robust and generalizable visual representations, by leveraging the synergies between these tasks. Extensive experiments on multiple action and gesture datasets demonstrate that handling actions and gestures in a single architecture can achieve better performance for both tasks in comparison to their single-task learning variants.

53. 【2505.17860】Multi-Person Interaction Generation from Two-Person Motion Priors

链接https://arxiv.org/abs/2505.17860

作者:Wenning Xu,Shiyu Fan,Paul Henderson,Edmond S. L. Ho

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multi-person interactions, social understanding, high-level controls, interactions, multi-person

备注: SIGGRAPH 2025 Conference Papers

点击查看摘要

Abstract:Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other's motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.

54. 【2505.17844】Locality-Sensitive Hashing for Efficient Hard Negative Sampling in Contrastive Learning

链接https://arxiv.org/abs/2505.17844

作者:Fabian Deuser,Philipp Hausenblas,Hannah Schieber,Daniel Roth,Martin Werner,Norbert Oswald

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neural network maps, network maps data, maps data elements, representational learning paradigm, Contrastive learning

备注

点击查看摘要

Abstract:Contrastive learning is a representational learning paradigm in which a neural network maps data elements to feature vectors. It improves the feature space by forming lots with an anchor and examples that are either positive or negative based on class similarity. Hard negative examples, which are close to the anchor in the feature space but from a different class, improve learning performance. Finding such examples of high quality efficiently in large, high-dimensional datasets is computationally challenging. In this paper, we propose a GPU-friendly Locality-Sensitive Hashing (LSH) scheme that quantizes real-valued feature vectors into binary representations for approximate nearest neighbor search. We investigate its theoretical properties and evaluate it on several datasets from textual and visual domain. Our approach achieves comparable or better performance while requiring significantly less computation than existing hard negative mining strategies.

55. 【2505.17835】VLM Models and Automated Grading of Atopic Dermatitis

链接https://arxiv.org/abs/2505.17835

作者:Marc Lalonde,Hamed Ghodrati

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:grading atopic dermatitis, atopic dermatitis, form of eczema, trained dermatologists, grading atopic

备注: 10 pages

点击查看摘要

Abstract:The task of grading atopic dermatitis (or AD, a form of eczema) from patient images is difficult even for trained dermatologists. Research on automating this task has progressed in recent years with the development of deep learning solutions; however, the rapid evolution of multimodal models and more specifically vision-language models (VLMs) opens the door to new possibilities in terms of explainable assessment of medical images, including dermatology. This report describes experiments carried out to evaluate the ability of seven VLMs to assess the severity of AD on a set of test images.

56. 【2505.17821】ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

链接https://arxiv.org/abs/2505.17821

作者:Shihao Li,Chenglong Li,Aihua Zheng,Jin Tang,Bin Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent transportation applications, effectively addressing challenges, Prompt Learning, transportation applications, effectively addressing

备注: Accepted by IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.

57. 【2505.17812】Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

链接https://arxiv.org/abs/2505.17812

作者:Boxu Chen,Ziwei Zheng,Le Yang,Zeyu Geng,Zhengyu Zhao,Chenhao Lin,Chao Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieved remarkable success, Large Vision-Language Models, generating outputs inconsistent, Large Vision-Language, achieved remarkable

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: this https URL.

58. 【2505.17808】An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

链接https://arxiv.org/abs/2505.17808

作者:Ramanathan Swaminathan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Cross Attention module, disruptive Vision Transformer, radical Cross Attention, research work reveals, eye opening wisdom

备注: 6 pages in general IEEE format, 8 figures, 4 tables, pdflatex

点击查看摘要

Abstract:This research work reveals the eye opening wisdom of the hybrid labyrinthine deep learning models synergy born out of combining a trailblazing convolutional neural network with a disruptive Vision Transformer, both intertwined together with a radical Cross Attention module. Here, two high yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized.

59. 【2505.17807】mporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition

链接https://arxiv.org/abs/2505.17807

作者:Ping Li,Jianan Ni,Bo Pang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Action recognition models, transferable attack methods, data modality, Background Mixup-induced Temporal, Existing transferable attack

备注: Accepted in IJCAI'25

点击查看摘要

Abstract:Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at this https URL.

60. 【2505.17799】A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

链接https://arxiv.org/abs/2505.17799

作者:Brian B. Moser,Arundhati S. Shanbhag,Stanislav Frolov,Federico Raue,Joachim Folz,Andreas Dengel

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:preserves essential patterns, effective machine learning, finding a small, representative subset, preserves essential

备注

点击查看摘要

Abstract:Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.

61. 【2505.17796】DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

链接https://arxiv.org/abs/2505.17796

作者:Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Lin Song,Jun Gao,Peng Li,Weiming Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Composed Image Retrieval, retrieve target images, Composed Image, aims to retrieve, retrieve target

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.

62. 【2505.17783】Generative Data Augmentation for Object Point Cloud Segmentation

链接https://arxiv.org/abs/2505.17783

作者:Dekai Zhu,Stefan Gavranovic,Flavien Boussuge,Benjamin Busam,Slobodan Ilic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:train deep learning, address data scarcity, deep learning models, Data augmentation, train deep

备注

点击查看摘要

Abstract:Data augmentation is widely used to train deep learning models to address data scarcity. However, traditional data augmentation (TDA) typically relies on simple geometric transformation, such as random rotation and rescaling, resulting in minimal data diversity enrichment and limited model performance improvement. State-of-the-art generative models for 3D shape generation rely on the denoising diffusion probabilistic models and manage to generate realistic novel point clouds for 3D content creation and manipulation. Nevertheless, the generated 3D shapes lack associated point-wise semantic labels, restricting their usage in enlarging the training data for point cloud segmentation tasks. To bridge the gap between data augmentation techniques and the advanced diffusion models, we extend the state-of-the-art 3D diffusion model, Lion, to a part-aware generative model that can generate high-quality point clouds conditioned on given segmentation masks. Leveraging the novel generative model, we introduce a 3-step generative data augmentation (GDA) pipeline for point cloud segmentation training. Our GDA approach requires only a small amount of labeled samples but enriches the training data with generated variants and pseudo-labeled samples, which are validated by a novel diffusion-based pseudo-label filtering method. Extensive experiments on two large-scale synthetic datasets and a real-world medical dataset demonstrate that our GDA method outperforms TDA approach and related semi-supervised and self-supervised methods.

63. 【2505.17782】Hephaestus Minicubes: A Global, Multi-Modal Dataset for Volcanic Unrest Monitoring

链接https://arxiv.org/abs/2505.17782

作者:Nikolas Papadopoulos,Nikolaos Ioannis Bountos,Maria Sdraka,Andreas Karavias,Ioannis Papoutsis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, Interferometric Synthetic Aperture, key precursor signal, precursor signal preceding, preceding volcanic eruptions

备注

点击查看摘要

Abstract:Ground deformation is regarded in volcanology as a key precursor signal preceding volcanic eruptions. Satellite-based Interferometric Synthetic Aperture Radar (InSAR) enables consistent, global-scale deformation tracking; however, deep learning methods remain largely unexplored in this domain, mainly due to the lack of a curated machine learning dataset. In this work, we build on the existing Hephaestus dataset, and introduce Hephaestus Minicubes, a global collection of 38 spatiotemporal datacubes offering high resolution, multi-source and multi-temporal information, covering 44 of the world's most active volcanoes over a 7-year period. Each spatiotemporal datacube integrates InSAR products, topographic data, as well as atmospheric variables which are known to introduce signal delays that can mimic ground deformation in InSAR imagery. Furthermore, we provide expert annotations detailing the type, intensity and spatial extent of deformation events, along with rich text descriptions of the observed scenes. Finally, we present a comprehensive benchmark, demonstrating Hephaestus Minicubes' ability to support volcanic unrest monitoring as a multi-modal, multi-temporal classification and semantic segmentation task, establishing strong baselines with state-of-the-art architectures. This work aims to advance machine learning research in volcanic monitoring, contributing to the growing integration of data-driven methods within Earth science applications.

64. 【2505.17779】U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

链接https://arxiv.org/abs/2505.17779

作者:Anjie Le,Henan Liu,Yue Wang,Zhenyu Liu,Rongkun Zhu,Taohan Weng,Jinze Yu,Boyang Wang,Yalun Wu,Kaiwen Yan,Quanlin Sun,Meirui Jiang,Jialun Pei,Siya Liu,Haoyun Zheng,Zhoujun Li,Alison Noble,Jacques Souquet,Xiaoqing Guo,Manxi Lin,Hongcheng Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:varying image quality, interpretation remains challenging, remains challenging due, widely-used imaging modality, imaging modality critical

备注

点击查看摘要

Abstract:Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 20 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

65. 【2505.17778】xtFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

链接https://arxiv.org/abs/2505.17778

作者:Yu Xie,Jielei Zhang,Pengyu Chen,Ziyue Wang,Weihang Wang,Longwen Gao,Peiyi Li,Huyang Sun,Qiang Zhang,Qian Qiao,Jiaqing Fan,Zhouhui Lian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:require large-scale annotated, Diffusion-based scene text, Diffusion-based scene, large-scale annotated data, visual conditioning modules

备注

点击查看摘要

Abstract:Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

66. 【2505.17771】opoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving

链接https://arxiv.org/abs/2505.17771

作者:Yanping Fu,Xinyuan Liu,Tianyu Li,Yike Ma,Yucheng Zhang,Feng Dai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Topology reasoning, plays a vital, autonomous driving, lane endpoints, unifies perception

备注

点击查看摘要

Abstract:Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at this https URL.

67. 【2505.17768】R-Genie: Reasoning-Guided Generative Image Editing

链接https://arxiv.org/abs/2505.17768

作者:Dong Zhang,Lingfeng He,Rui Yan,Fei Shen,Jinhui Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current methods remain, lacking deep comprehension, methods remain constrained, explicit textual instructions, limited editing operations

备注: [this https URL](https://dongzhang89.github.io/RGenie.github.io/)

点击查看摘要

Abstract:While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.

68. 【2505.17748】Soft-CAM: Making black box models self-explainable for high-stakes decisions

链接https://arxiv.org/abs/2505.17748

作者:Kerol Djoumessi,Philipp Berens

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional neural networks, Convolutional neural, surpassing human performance, neural networks, surpassing human

备注

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are widely used for high-stakes applications like medicine, often surpassing human performance. However, most explanation methods rely on post-hoc attribution, approximating the decision-making process of already trained black-box models. These methods are often sensitive, unreliable, and fail to reflect true model reasoning, limiting their trustworthiness in critical applications. In this work, we introduce SoftCAM, a straightforward yet effective approach that makes standard CNN architectures inherently interpretable. By removing the global average pooling layer and replacing the fully connected classification layer with a convolution-based class evidence layer, SoftCAM preserves spatial information and produces explicit class activation maps that form the basis of the model's predictions. Evaluated on three medical datasets, SoftCAM maintains classification performance while significantly improving both the qualitative and quantitative explanation compared to existing post-hoc methods. Our results demonstrate that CNNs can be inherently interpretable without compromising performance, advancing the development of self-explainable deep learning for high-stakes decision-making.

69. 【2505.17732】RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection

链接https://arxiv.org/abs/2505.17732

作者:Ozsel Kilinc,Cem Tarhan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:object detection, oriented object detection, object, detection, Accurate

备注

点击查看摘要

Abstract:Accurate, fast, and reliable 3D perception is essential for autonomous driving. Recently, bird's-eye view (BEV)-based perception approaches have emerged as superior alternatives to perspective-based solutions, offering enhanced spatial understanding and more natural outputs for planning. Existing BEV-based 3D object detection methods, typically adhering to angle-based representation, directly estimate the size and orientation of rotated bounding boxes. We observe that BEV-based 3D object detection is analogous to aerial oriented object detection, where angle-based methods are recognized for being affected by discontinuities in their loss functions. Drawing inspiration from this domain, we propose Restricted Quadrilateral Representation to define 3D regression targets. RQR3D regresses the smallest horizontal bounding box encapsulating the oriented box, along with the offsets between the corners of these two boxes, thereby transforming the oriented object detection problem into a keypoint regression task. RQR3D is compatible with any 3D object detection approach. We employ RQR3D within an anchor-free single-stage object detection method and introduce an objectness head to address class imbalance problem. Furthermore, we introduce a simplified radar fusion backbone that eliminates the need for voxel grouping and processes the BEV-mapped point cloud with standard 2D convolutions, rather than sparse convolutions. Extensive evaluations on the nuScenes dataset demonstrate that RQR3D achieves state-of-the-art performance in camera-radar 3D object detection, outperforming the previous best method by +4% in NDS and +2.4% in mAP, and significantly reducing the translation and orientation errors, which are crucial for safe autonomous driving. These consistent gains highlight the robustness, precision, and real-world readiness of our approach.

70. 【2505.17727】SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain

链接https://arxiv.org/abs/2505.17727

作者:Jiawei Zhou,Linye Lyu,Zhuotao Tian,Cheng Zhuo,Yu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving systems, scenarios are rare, rare yet pivotal, pivotal for evaluating, evaluating and enhancing

备注

点击查看摘要

Abstract:Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: this https URL.

71. 【2505.17726】Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

链接https://arxiv.org/abs/2505.17726

作者:Donghwan Chi,Hyomin Kim,Yoonjin Oh,Yongjin Kim,Donghoon Lee,Daejin Jo,Jongmin Kim,Junyeob Baek,Sungjin Ahn,Sungwoong Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large language models, artificial general intelligence, achieving artificial general, multimodal large language, language models

备注

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

72. 【2505.17721】SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation

链接https://arxiv.org/abs/2505.17721

作者:Dekai Zhu,Yan Di,Stefan Gavranovic,Slobodan Ilic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling numerous downstream, numerous downstream applications, achieved significant success, point clouds, generated point clouds

备注

点击查看摘要

Abstract:Denoising diffusion probabilistic models have achieved significant success in point cloud generation, enabling numerous downstream applications, such as generative data augmentation and 3D model editing. However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point clouds with fine-grained segmentation labels. Specifically, we introduce the semantic part-aware latent point diffusion technique, which leverages the intermediate features of the generative models to jointly predict the noise for perturbed latent points and associated part segmentation labels during the denoising process, and subsequently decodes the latent points to point clouds conditioned on part segmentation labels. To effectively evaluate the quality of generated point clouds, we introduce a novel point cloud pairwise distance calculation method named part-aware Chamfer distance (p-CD). This method enables existing metrics, such as 1-NNA, to measure both the local structural quality and inter-part coherence of generated point clouds. Experiments on the large-scale synthetic dataset ShapeNet and real-world medical dataset IntrA demonstrate that SeaLion achieves remarkable performance in generation quality and diversity, outperforming the existing state-of-the-art model, DiffFacto, by 13.33% and 6.52% on 1-NNA (p-CD) across the two datasets. Experimental analysis shows that SeaLion can be trained semi-supervised, thereby reducing the demand for labeling efforts. Lastly, we validate the applicability of SeaLion in generative data augmentation for training segmentation models and the capability of SeaLion to serve as a tool for part-aware 3D shape editing.

73. 【2505.17702】Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

链接https://arxiv.org/abs/2505.17702

作者:Xueyang Li,Jiahao Li,Yu Song,Yunzhong Lou,Xiangdong Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, CAD model, CAD parametric models, parametric CAD model, CAD

备注

点击查看摘要

Abstract:The advent of Computer-Aided Design (CAD) generative modeling will significantly transform the design of industrial products. The recent research endeavor has extended into the realm of Large Language Models (LLMs). In contrast to fine-tuning methods, training-free approaches typically utilize the advanced closed-source LLMs, thereby offering enhanced flexibility and efficiency in the development of AI agents for generating CAD parametric models. However, the substantial cost and limitations of local deployment of the top-tier closed-source LLMs pose challenges in practical applications. The Seek-CAD is the pioneer exploration of locally deployed open-source inference LLM DeepSeek-R1 for CAD parametric model generation with a training-free methodology. This study is the first investigation to incorporate both visual and Chain-of-Thought (CoT) feedback within the self-refinement mechanism for generating CAD models. Specifically, the initial generated parametric CAD model is rendered into a sequence of step-wise perspective images, which are subsequently processed by a Vision Language Model (VLM) alongside the corresponding CoTs derived from DeepSeek-R1 to assess the CAD model generation. Then, the feedback is utilized by DeepSeek-R1 to refine the initial generated model for the next round of generation. Moreover, we present an innovative 3D CAD model dataset structured around the SSR (Sketch, Sketch-based feature, and Refinements) triple design paradigm. This dataset encompasses a wide range of CAD commands, thereby aligning effectively with industrial application requirements and proving suitable for the generation of LLMs. Extensive experiments validate the effectiveness of Seek-CAD under various metrics.

74. 【2505.17695】SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data

链接https://arxiv.org/abs/2505.17695

作者:Dong-Hee Kim,Hyunjee Song,Donghyun Kim

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring Expression Segmentation, Expression Segmentation, Referring Expression, protocols remain constrained, advances in Referring

备注

点击查看摘要

Abstract:Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at this https URL.

75. 【2505.17692】ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

链接https://arxiv.org/abs/2505.17692

作者:Ziteng Yang,Jingzehua Xu,Yanshu Li,Zepeng Li,Yeqiang Wang,Xinghui Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Zero-shot anomaly detection, domain training samples, external auxiliary data, target domain training, aims to detect

备注

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

76. 【2505.17690】Semi-Supervised Medical Image Segmentation via Dual Networks

链接https://arxiv.org/abs/2505.17690

作者:Yunyao Lu,Yihang Wu,Reem Kateb,Ahmad Chaddad

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional supervised medical, Traditional supervised, large-scale labeled datasets, require large amounts, segmentation models require

备注: Accepted in ISBI2025

点击查看摘要

Abstract:Traditional supervised medical image segmentation models require large amounts of labeled data for training; however, obtaining such large-scale labeled datasets in the real world is extremely challenging. Recent semi-supervised segmentation models also suffer from noisy pseudo-label issue and limited supervision in feature space. To solve these challenges, we propose an innovative semi-supervised 3D medical image segmentation method to reduce the dependency on large, expert-labeled datasets. Furthermore, we introduce a dual-network architecture to address the limitations of existing methods in using contextual information and generating reliable pseudo-labels. In addition, a self-supervised contrastive learning strategy is used to enhance the representation of the network and reduce prediction uncertainty by distinguishing between reliable and unreliable predictions. Experiments on clinical magnetic resonance imaging demonstrate that our approach outperforms state-of-the-art techniques. Our code is available at this https URL.

77. 【2505.17685】FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

链接https://arxiv.org/abs/2505.17685

作者:Shuang Zeng,Xinyuan Chang,Mengwei Xie,Xinran Liu,Yifan Bai,Zheng Pan,Mu Xu,Xing Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted increasing interest, powerful reasoning capabilities, attracted increasing, increasing interest, autonomous driving due

备注

点击查看摘要

Abstract:Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities. However, existing VLMs typically utilize discrete text Chain-of-Thought (CoT) tailored to the current scenario, which essentially represents highly abstract and symbolic compression of visual information, potentially leading to spatio-temporal relationship ambiguity and fine-grained information loss. Is autonomous driving better modeled on real-world simulation and imagination than on pure symbolic logic? In this paper, we propose a spatio-temporal CoT reasoning method that enables models to think visually. First, VLM serves as a world model to generate unified image frame for predicting future world states: where perception results (e.g., lane divider and 3D detection) represent the future spatial relationships, and ordinary future frame represent the temporal evolution relationships. This spatio-temporal CoT then serves as intermediate reasoning steps, enabling the VLM to function as an inverse dynamics model for trajectory planning based on current observations and future predictions. To implement visual generation in VLMs, we propose a unified pretraining paradigm integrating visual generation and understanding, along with a progressive visual CoT enhancing autoregressive image generation. Extensive experimental results demonstrate the effectiveness of the proposed method, advancing autonomous driving towards visual reasoning.

78. 【2505.17684】5G-DIL: Domain Incremental Learning with Similarity-Aware Sampling for Dynamic 5G Indoor Localization

链接https://arxiv.org/abs/2505.17684

作者:Nisha Lakshmana Raichur,Lucas Heublein,Christopher Mutschler,Felix Ott

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent machine learning, adoption of recent, recent machine, machine learning, achieved high accuracy

备注: 7 pages, 6 figures

点击查看摘要

Abstract:Indoor positioning based on 5G data has achieved high accuracy through the adoption of recent machine learning (ML) techniques. However, the performance of learning-based methods degrades significantly when environmental conditions change, thereby hindering their applicability to new scenarios. Acquiring new training data for each environmental change and fine-tuning ML models is both time-consuming and resource-intensive. This paper introduces a domain incremental learning (DIL) approach for dynamic 5G indoor localization, called 5G-DIL, enabling rapid adaptation to environmental changes. We present a novel similarity-aware sampling technique based on the Chebyshev distance, designed to efficiently select specific exemplars from the previous environment while training only on the modified regions of the new environment. This avoids the need to train on the entire region, significantly reducing the time and resources required for adaptation without compromising localization accuracy. This approach requires as few as 50 exemplars from adaptation domains, significantly reducing training time while maintaining high positioning accuracy in previous environments. Comparative evaluations against state-of-the-art DIL techniques on a challenging real-world indoor dataset demonstrate the effectiveness of the proposed sample selection method. Our approach is adaptable to real-world non-line-of-sight propagation scenarios and achieves an MAE positioning error of 0.261 meters, even under dynamic environmental conditions. Code: this https URL

79. 【2505.17677】owards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery

链接https://arxiv.org/abs/2505.17677

作者:Ming Hu,Zhendi Yu,Feilong Tang,Kaiwen Chen,Yulong Li,Imran Razzak,Junjun He,Tolga Birdal,Kaijing Zhou,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reliable annotation tools, lack of realistic, critical for vision-based, vision-based analysis, MANO hand meshes

备注

点击查看摘要

Abstract:Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23% in ADD-S metrics for hand and instrument reconstruction, respectively.

80. 【2505.17674】SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

链接https://arxiv.org/abs/2505.17674

作者:Xuerui Qiu,Peixi Wu,Yaozhi Wen,Shaowei Gu,Yuqi Pan,Xinhao Luo,Bo XU,Guoqi Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Spiking Neural Networks, Artificial Neural Networks, Neural Networks, Spiking Neural, Artificial Neural

备注

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available this https URL.

81. 【2505.17666】Proto-FG3D: Prototype-based Interpretable Fine-Grained 3D Shape Classification

链接https://arxiv.org/abs/2505.17666

作者:Shuxian Ma,Zihao Dong,Runmin Cong,Sam Kwong,Xiuli Shao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning-based multi-view, achieved remarkable success, Deep learning-based, learning-based multi-view coarse-grained, past decade

备注: 11 pages, 2 figures, 5 tablets; Submitted to BMVC2025

点击查看摘要

Abstract:Deep learning-based multi-view coarse-grained 3D shape classification has achieved remarkable success over the past decade, leveraging the powerful feature learning capabilities of CNN-based and ViT-based backbones. However, as a challenging research area critical for detailed shape understanding, fine-grained 3D classification remains understudied due to the limited discriminative information captured during multi-view feature aggregation, particularly for subtle inter-class variations, class imbalance, and inherent interpretability limitations of parametric model. To address these problems, we propose the first prototype-based framework named Proto-FG3D for fine-grained 3D shape classification, achieving a paradigm shift from parametric softmax to non-parametric prototype learning. Firstly, Proto-FG3D establishes joint multi-view and multi-category representation learning via Prototype Association. Secondly, prototypes are refined via Online Clustering, improving both the robustness of multi-view feature allocation and inter-subclass balance. Finally, prototype-guided supervised learning is established to enhance fine-grained discrimination via prototype-view correlation analysis and enables ad-hoc interpretability through transparent case-based reasoning. Experiments on FG3D and ModelNet40 show Proto-FG3D surpasses state-of-the-art methods in accuracy, transparent predictions, and ad-hoc interpretability with visualizations, challenging conventional fine-grained 3D recognition approaches.

82. 【2505.17665】EMRA-proxy: Enhancing Multi-Class Region Semantic Segmentation in Remote Sensing Images with Attention Proxy

链接https://arxiv.org/abs/2505.17665

作者:Yichun Yu,Yuqing Lan,Zhihuan Xing,Xiaoyi Yang,Tingyue Tang,Dan Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:High-resolution remote sensing, diverse object appearances, High-resolution remote, Region-Aware Proxy Network, complex spatial layouts

备注: Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024): Poster Volume I. Tianjin, China, 2024: 538-562

点击查看摘要

Abstract:High-resolution remote sensing (HRRS) image segmentation is challenging due to complex spatial layouts and diverse object appearances. While CNNs excel at capturing local features, they struggle with long-range dependencies, whereas Transformers can model global context but often neglect local details and are computationally this http URL propose a novel approach, Region-Aware Proxy Network (RAPNet), which consists of two components: Contextual Region Attention (CRA) and Global Class Refinement (GCR). Unlike traditional methods that rely on grid-based layouts, RAPNet operates at the region level for more flexible segmentation. The CRA module uses a Transformer to capture region-level contextual dependencies, generating a Semantic Region Mask (SRM). The GCR module learns a global class attention map to refine multi-class information, combining the SRM and attention map for accurate this http URL on three public datasets show that RAPNet outperforms state-of-the-art methods, achieving superior multi-class segmentation accuracy.

83. 【2505.17659】Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

链接https://arxiv.org/abs/2505.17659

作者:Xiaolong Tang,Meina Kan,Shiguang Shan,Xilin Chen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving systems, Safe and feasible, real-world autonomous driving, essential for real-world, real-world autonomous

备注

点击查看摘要

Abstract:Safe and feasible trajectory planning is essential for real-world autonomous driving systems. However, existing learning-based planning methods often rely on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting unsafe behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a novel two-stage trajectory planning framework that formulates trajectory planning as a sequential prediction task, guided by explicit planning principles such as safety, comfort, and traffic rule compliance. In the first stage, we train an autoregressive trajectory predictor via next motion token prediction on expert data. In the second stage, we design rule-based rewards (e.g., collision avoidance, speed limits) and fine-tune the model using Group Relative Policy Optimization (GRPO), a reinforcement learning strategy, to align its predictions with these planning principles. Experiments on the nuPlan benchmark demonstrate that our Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance.

84. 【2505.17649】Instruct2See: Learning to Remove Any Obstructions Across Distributions

链接https://arxiv.org/abs/2505.17649

作者:Junhang Li,Yu Guo,Chuhua Xian,Shengfeng He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:capture limitations, hindering the observation, objects of interest, due to capture, observation of objects

备注

点击查看摘要

Abstract:Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at this https URL.

85. 【2505.17645】HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

链接https://arxiv.org/abs/2505.17645

作者:Chuhao Zhou,Jianfei Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:diverse sensory inputs, Embodied agents operating, Large Language Model, understand human behavior, Multimodal Large Language

备注: 18 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

86. 【2505.17625】Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

链接https://arxiv.org/abs/2505.17625

作者:Hayato Aida,Kosuke Takahashi,Takahiro Omi

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Language Models, understand table structures, Large Language, retrieval-augmented generation

备注: Accepted at IIAI AAI 2025, the 3rd International Conference on Computational and Data Sciences in Economics and Finance

点击查看摘要

Abstract:With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features. Experimental results demonstrate that these auxiliary modalities significantly improve performance, enabling robust interpretation of complex document layouts without relying on explicitly structured input formats.

87. 【2505.17619】CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment

链接https://arxiv.org/abs/2505.17619

作者:Bo Wang,De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Nu-Fang Xiao,Jian-Long Hao,Ming-Yuan Liu,Zeng-Guang Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:X-ray angiographies generated, vascular interventional procedures, Synthetic X-ray angiographies, hold great potential, Synthetic X-ray

备注: Under review

点击查看摘要

Abstract:Synthetic X-ray angiographies generated by modern generative models hold great potential to reduce the use of contrast agents in vascular interventional procedures. However, low-quality synthetic angiographies can significantly increase procedural risk, underscoring the need for reliable image quality assessment (IQA) methods. Existing IQA models, however, fail to leverage auxiliary images as references during evaluation and lack fine-grained, task-specific metrics necessary for clinical relevance. To address these limitations, this paper proposes CAS-IQA, a vision-language model (VLM)-based framework that predicts fine-grained quality scores by effectively incorporating auxiliary information from related images. In the absence of angiography datasets, CAS-3K is constructed, comprising 3,565 synthetic angiographies along with score annotations. To ensure clinically meaningful assessment, three task-specific evaluation metrics are defined. Furthermore, a Multi-path featUre fuSion and rouTing (MUST) module is designed to enhance image representations by adaptively fusing and routing visual tokens to metric-specific branches. Extensive experiments on the CAS-3K dataset demonstrate that CAS-IQA significantly outperforms state-of-the-art IQA methods by a considerable margin.

88. 【2505.17618】Scaling Image and Video Generation via Test-Time Evolutionary Search

链接https://arxiv.org/abs/2505.17618

作者:Haoran He,Jiajun Liang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Ling Pan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:allocating additional computation, model pre-training continues, improving generative model, generative model performance, test-time scaling

备注: 37 pages. Project: [this https URL](https://tinnerhrhe.github.io/evosearch)

点击查看摘要

Abstract:As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website this https URL.

89. 【2505.17614】PathoSCOPE: Few-Shot Pathology Detection via Self-Supervised Contrastive Learning and Pathology-Informed Synthetic Embeddings

链接https://arxiv.org/abs/2505.17614

作者:Sinchee Chin,Yinuo Ma,Xiaochen Yang,Jing-Hao Xue,Wenming Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:avoiding costly annotations, offering strong generalizability, Unsupervised pathology detection, pathology detection trains, detection trains models

备注

点击查看摘要

Abstract:Unsupervised pathology detection trains models on non-pathological data to flag deviations as pathologies, offering strong generalizability for identifying novel diseases and avoiding costly annotations. However, building reliable normality models requires vast healthy datasets, as hospitals' data is inherently biased toward symptomatic populations, while privacy regulations hinder the assembly of representative healthy cohorts. To address this limitation, we propose PathoSCOPE, a few-shot unsupervised pathology detection framework that requires only a small set of non-pathological samples (minimum 2 shots), significantly improving data efficiency. We introduce Global-Local Contrastive Loss (GLCL), comprised of a Local Contrastive Loss to reduce the variability of non-pathological embeddings and a Global Contrastive Loss to enhance the discrimination of pathological regions. We also propose a Pathology-informed Embedding Generation (PiEG) module that synthesizes pathological embeddings guided by the global loss, better exploiting the limited non-pathological samples. Evaluated on the BraTS2020 and ChestXray8 datasets, PathoSCOPE achieves state-of-the-art performance among unsupervised methods while maintaining computational efficiency (2.48 GFLOPs, 166 FPS).

90. 【2505.17613】MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

链接https://arxiv.org/abs/2505.17613

作者:Jihan Yao,Yushi Hu,Yujie Yi,Bin Han,Shangbin Feng,Guang Yang,Bingbing Wen,Ranjay Krishna,Lucy Lu Wang,Yulia Tsvetkov,Noah A. Smith,Banghua Zhu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatically evaluating multimodal, involve multiple modalities, Automatically evaluating, present significant challenges, evaluating multimodal generation

备注

点击查看摘要

Abstract:Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

91. 【2505.17591】MinkUNeXt-SI: Improving point cloud-based place recognition including spherical coordinates and LiDAR intensity

链接https://arxiv.org/abs/2505.17591

作者:Judith Vilella-Cantos,Juan José Cabrera,Luis Payá,Mónica Ballesta,David Valiente

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:autonomous navigation systems, place recognition problem, navigation systems, safe functioning, autonomous navigation

备注

点击查看摘要

Abstract:In autonomous navigation systems, the solution of the place recognition problem is crucial for their safe functioning. But this is not a trivial solution, since it must be accurate regardless of any changes in the scene, such as seasonal changes and different weather conditions, and it must be generalizable to other environments. This paper presents our method, MinkUNeXt-SI, which, starting from a LiDAR point cloud, preprocesses the input data to obtain its spherical coordinates and intensity values normalized within a range of 0 to 1 for each point, and it produces a robust place recognition descriptor. To that end, a deep learning approach that combines Minkowski convolutions and a U-net architecture with skip connections is used. The results of MinkUNeXt-SI demonstrate that this method reaches and surpasses state-of-the-art performance while it also generalizes satisfactorily to other datasets. Additionally, we showcase the capture of a custom dataset and its use in evaluating our solution, which also achieves outstanding results. Both the code of our solution and the runs of our dataset are publicly available for reproducibility purposes.

92. 【2505.17590】CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis

链接https://arxiv.org/abs/2505.17590

作者:Florian Barthel,Wieland Morgenstern,Paul Hinzer,Anna Hilsmann,Peter Eisert

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian splatting, Gaussian Splatting GAN, Recently, Gaussian, training

备注: Main paper 12 pages, supplementary materials 8 pages

点击查看摘要

Abstract:Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: this https URL

93. 【2505.17581】MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery

链接https://arxiv.org/abs/2505.17581

作者:Hainuo Wang,Qiming Hu,Xiaojie Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Restoring images degraded, fine-grained rain streaks, versus widespread haze, significant challenge due, spatially heterogeneous nature

备注

点击查看摘要

Abstract:Restoring images degraded by adverse weather remains a significant challenge due to the highly non-uniform and spatially heterogeneous nature of weather-induced artifacts, e.g., fine-grained rain streaks versus widespread haze. Accurately estimating the underlying degradation can intuitively provide restoration models with more targeted and effective guidance, enabling adaptive processing strategies. To this end, we propose a Morton-Order Degradation Estimation Mechanism (MODEM) for adverse weather image restoration. Central to MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates Morton-coded spatial ordering with selective state-space models to capture long-range dependencies while preserving local structural coherence. Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM) that disentangles and estimates both global and local degradation priors. These priors dynamically condition the MOS2D modules, facilitating adaptive and context-aware restoration. Extensive experiments and ablation studies demonstrate that MODEM achieves state-of-the-art results across multiple benchmarks and weather types, highlighting its effectiveness in modeling complex degradation dynamics. Our code will be released at this https URL.

94. 【2505.17574】InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

链接https://arxiv.org/abs/2505.17574

作者:Xueji Fang,Liyuan Ma,Zhiyang Chen,Mingyuan Zhou,Guo-jun Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, high-quality videos depicting, videos depicting individual, depicting individual scenes, enabled the synthesis

备注: Preprint. Under review

点击查看摘要

Abstract:Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top-$K$ most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9$\times$, achieving strong consistency and semantic fidelity across scenes. Our code is available at this https URL.

95. 【2505.17567】Enhancing Fourier-based Doppler Resolution with Diffusion Models

链接https://arxiv.org/abs/2505.17567

作者:Denisa Qosja,Kilian Barth,Simon Wagner

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:detecting slow-moving targets, radar systems, stationary objects, dimension is important, important for detecting

备注: Published at International Radar Symposium (IRS) 2025

点击查看摘要

Abstract:In radar systems, high resolution in the Doppler dimension is important for detecting slow-moving targets as it allows for more distinct separation between these targets and clutter, or stationary objects. However, achieving sufficient resolution is constrained by hardware capabilities and physical factors, leading to the development of processing techniques to enhance the resolution after acquisition. In this work, we leverage artificial intelligence to increase the Doppler resolution in range-Doppler maps. Based on a zero-padded FFT, a refinement via the generative neural networks of diffusion models is achieved. We demonstrate that our method overcomes the limitations of traditional FFT, generating data where closely spaced targets are effectively separated.

96. 【2505.17561】Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

链接https://arxiv.org/abs/2505.17561

作者:Kwanyoung Kim,Sanghyun Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:initial noise significantly, noise significantly affects, Active Noise Selection, Bayesian Active Noise, Noise Selection

备注: 19 pages, 10 figures

点击查看摘要

Abstract:The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: this https URL

97. 【2505.17560】Deeper Diffusion Models Amplify Bias

链接https://arxiv.org/abs/2505.17560

作者:Shahin Hakemi,Naveed Akhtar,Ghulam Mubashar Hassan,Ajmal Mian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Models, generative Diffusion Models, potentially problematic, impressive performance, internal working

备注

点击查看摘要

Abstract:Despite the impressive performance of generative Diffusion Models (DMs), their internal working is still not well understood, which is potentially problematic. This paper focuses on exploring the important notion of bias-variance tradeoff in diffusion models. Providing a systematic foundation for this exploration, it establishes that at one extreme the diffusion models may amplify the inherent bias in the training data and, on the other, they may compromise the presumed privacy of the training samples. Our exploration aligns with the memorization-generalization understanding of the generative models, but it also expands further along this spectrum beyond ``generalization'', revealing the risk of bias amplification in deeper models. Building on the insights, we also introduce a training-free method to improve output quality in text-to-image and image-to-image generation. By progressively encouraging temporary high variance in the generation process with partial bypassing of the mid-block's contribution in the denoising process of DMs, our method consistently improves generative image quality with zero training cost. Our claims are validated both theoretically and empirically.

98. 【2505.17556】Wildfire spread forecasting with Deep Learning

链接https://arxiv.org/abs/2505.17556

作者:Nikolaos Anastasiou,Spyros Kondylatos,Ioannis Papoutsis

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:effective risk management, strategic resource allocation, Accurate prediction, risk management, resource allocation

备注: 10 pages, 9 figures

点击查看摘要

Abstract:Accurate prediction of wildfire spread is crucial for effective risk management, emergency response, and strategic resource allocation. In this study, we present a deep learning (DL)-based framework for forecasting the final extent of burned areas, using data available at the time of ignition. We leverage a spatio-temporal dataset that covers the Mediterranean region from 2006 to 2022, incorporating remote sensing data, meteorological observations, vegetation maps, land cover classifications, anthropogenic factors, topography data, and thermal anomalies. To evaluate the influence of temporal context, we conduct an ablation study examining how the inclusion of pre- and post-ignition data affects model performance, benchmarking the temporal-aware DL models against a baseline trained exclusively on ignition-day inputs. Our results indicate that multi-day observational data substantially improve predictive accuracy. Particularly, the best-performing model, incorporating a temporal window of four days before to five days after ignition, improves both the F1 score and the Intersection over Union by almost 5% in comparison to the baseline on the test dataset. We publicly release our dataset and models to enhance research into data-driven approaches for wildfire modeling and response.

99. 【2505.17555】ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization

链接https://arxiv.org/abs/2505.17555

作者:Yuchen He,Jianbing Lv,Liqi Cheng,Lingyu Meng,Dazhen Deng,Yingcai Wu

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:Temporal Action Localization, Action Localization, aims to detect, detect the start, start and end

备注: Accepted at CHI'25

点击查看摘要

Abstract:Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

100. 【2505.17551】Center-aware Residual Anomaly Synthesis for Multi-class Industrial Anomaly Detection

链接https://arxiv.org/abs/2505.17551

作者:Qiyu Chen,Huiyuan Luo,Haiming Yao,Wei Luo,Zhen Qu,Chengkan Lv,Zhengtao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Anomaly detection plays, plays a vital, vital role, multi-class anomaly detection, Anomaly detection

备注: Accepted by IEEE Transactions on Industrial Informatics (TII)

点击查看摘要

Abstract:Anomaly detection plays a vital role in the inspection of industrial images. Most existing methods require separate models for each category, resulting in multiplied deployment costs. This highlights the challenge of developing a unified model for multi-class anomaly detection. However, the significant increase in inter-class interference leads to severe missed detections. Furthermore, the intra-class overlap between normal and abnormal samples, particularly in synthesis-based methods, cannot be ignored and may lead to over-detection. To tackle these issues, we propose a novel Center-aware Residual Anomaly Synthesis (CRAS) method for multi-class anomaly detection. CRAS leverages center-aware residual learning to couple samples from different categories into a unified center, mitigating the effects of inter-class interference. To further reduce intra-class overlap, CRAS introduces distance-guided anomaly synthesis that adaptively adjusts noise variance based on normal data distribution. Experimental results on diverse datasets and real-world industrial applications demonstrate the superior detection accuracy and competitive inference speed of CRAS. The source code and the newly constructed dataset are publicly available at this https URL.

101. 【2505.17550】2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models

链接https://arxiv.org/abs/2505.17550

作者:Xiaoyu Ye,Songjie Cheng,Yongtao Wang,Yajiao Xiong,Yishen Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, generated videos, significantly enhanced, enhanced the quality, quality of generated

备注

点击查看摘要

Abstract:Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their ability to produce explicit or harmful content raises concerns about misuse and potential rights violations. Inspired by the success of unlearning techniques in erasing undesirable concepts from text-to-image (T2I) models, we extend unlearning to T2V models and propose a robust and precise unlearning method. Specifically, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against LLM-refined prompts. To achieve precise unlearning, we incorporate a localization and a preservation regularization to preserve the model's ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model's generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \href{this https URL}{this https URL}.

102. 【2505.17540】RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

链接https://arxiv.org/abs/2505.17540

作者:Mingrui Wu,Lu Wang,Pu Zhao,Fangkai Yang,Jianjin Zhang,Jianfeng Liu,Yuefeng Zhan,Weihao Han,Hao Sun,Jiayi Ji,Xiaoshuai Sun,Qingwei Lin,Weiwei Deng,Dongmei Zhang,Feng Sun,Qi Zhang,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:faithfully capture user, capture user intentions, struggle to faithfully, faithfully capture, capture user

备注: Code is available at: [this https URL](https://github.com/microsoft/DKI_LLM/tree/main/RePrompt)

点击查看摘要

Abstract:Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.

103. 【2505.17534】Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

链接https://arxiv.org/abs/2505.17534

作者:Jingjing Jiang,Chongjie Si,Jun Luo,Hanwang Zhang,Chao Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:group relative policy, large language models, simultaneously reinforcing generation, multimodal large language, relative policy optimization

备注

点击查看摘要

Abstract:This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce \textbf{CoRL}, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, \textbf{ULM-R1}, achieves average improvements of \textbf{7%} on three text-to-image generation datasets and \textbf{23%} on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs.

104. 【2505.17529】Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding

链接https://arxiv.org/abs/2505.17529

作者:Yeongjae Cho,Keonwoo Kim,Taebaek Hwang,Sungzoon Cho

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, visual question answering, Recent advancements, advancements in Large, Large Vision-Language

备注

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.

105. 【2505.17509】Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning

链接https://arxiv.org/abs/2505.17509

作者:Shiji Zhao,Qihui Zhu,Shukun Xiong,Shouwei Ruan,Yize Fan,Ranjie Duan,Qing Guo,Xingxing Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Large pre-trained Vision, pre-trained Vision Language, Vision Language, presenting potential security

备注

点击查看摘要

Abstract:Large pre-trained Vision Language Models (VLMs) have excellent generalization capabilities but are highly susceptible to adversarial examples, presenting potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which finally leads to the overfitting phenomenon. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts can bring more robustness improvement than a longer prompt. Then we propose an adversarial tuning method named Adversarial Mixture Prompt Tuning (AMPT) to enhance the generalization towards various adversarial attacks for VLMs. AMPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the input adversarial image to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific aggregated text features aligning with different adversarial image features. A series of experiments show that our method can achieve better adversarial robustness than state-of-the-art methods on 11 datasets under different experimental settings.

106. 【2505.17501】RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition

链接https://arxiv.org/abs/2505.17501

作者:Yuehan Jin,Xiaoqing Liu,Yiyuan Yang,Zhiwen Yu,Tong Zhang,Kaixiang Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multimodal emotion recognition, emotion recognition analyzes, recognition analyzes emotions, emotion recognition, Incomplete Multimodal Emotion

备注

点击查看摘要

Abstract:Multimodal emotion recognition analyzes emotions by combining data from multiple sources. However, real-world noise or sensor failures often cause missing or corrupted data, creating the Incomplete Multimodal Emotion Recognition (IMER) challenge. In this paper, we propose Robust Hybrid Diffusion Recovery (RoHyDR), a novel framework that performs missing-modality recovery at unimodal, multimodal, feature, and semantic levels. For unimodal representation recovery of missing modalities, RoHyDR exploits a diffusion-based generator to generate distribution-consistent and semantically aligned representations from Gaussian noise, using available modalities as conditioning. For multimodal fusion recovery, we introduce adversarial learning to produce a realistic fused multimodal representation and recover missing semantic content. We further propose a multi-stage optimization strategy that enhances training stability and efficiency. In contrast to previous work, the hybrid diffusion and adversarial learning-based recovery mechanism in RoHyDR allows recovery of missing information in both unimodal representation and multimodal fusion, at both feature and semantic levels, effectively mitigating performance degradation caused by suboptimal optimization. Comprehensive experiments conducted on two widely used multimodal emotion recognition benchmarks demonstrate that our proposed method outperforms state-of-the-art IMER methods, achieving robust recognition performance under various missing-modality scenarios. Our code will be made publicly available upon acceptance.

107. 【2505.17493】Research on Defect Detection Method of Motor Control Board Based on Image Processing

链接https://arxiv.org/abs/2505.17493

作者:Jingde Huang,Zhangyu Huang,Chenyu Li,Jiantong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:motor control board, motor control, control board, incorrect plug-in positions, control

备注

点击查看摘要

Abstract:The motor control board has various defects such as inconsistent color differences, incorrect plug-in positions, solder short circuits, and more. These defects directly affect the performance and stability of the motor control board, thereby having a negative impact on product quality. Therefore, studying the defect detection technology of the motor control board is an important means to improve the quality control level of the motor control board. Firstly, the processing methods of digital images about the motor control board were studied, and the noise suppression methods that affect image feature extraction were analyzed. Secondly, a specific model for defect feature extraction and color difference recognition of the tested motor control board was established, and qualified or defective products were determined based on feature thresholds. Thirdly, the search algorithm for defective images was optimized. Finally, comparative experiments were conducted on the typical motor control board, and the experimental results demonstrate that the accuracy of the motor control board defect detection model-based on image processing established in this paper reached over 99%. It is suitable for timely image processing of large quantities of motor control boards on the production line, and achieved efficient defect detection. The defect detection method can not only be used for online detection of the motor control board defects, but also provide solutions for the integrated circuit board defect processing for the industry.

108. 【2505.17476】he Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts

链接https://arxiv.org/abs/2505.17476

作者:Yuchen Zhang,Yaxiong Wang,Yujiao Wu,Lianwei Wu,Li Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:combating AI-generated disinformation, detection and grounding, grounding of multimedia, critical challenge, challenge in combating

备注

点击查看摘要

Abstract:The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework's superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions.

109. 【2505.17475】PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation

链接https://arxiv.org/abs/2505.17475

作者:Uyoung Jeong,Jonathan Freer,Seungryul Baek,Hyung Jin Chang,Kwang In Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:study multi-dataset training, skeletal heterogeneity presents, multi-dataset training, study multi-dataset, presents a unique

备注: accepted to CVPR 2025

点击查看摘要

Abstract:We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: this https URL.

110. 【2505.17473】OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics

链接https://arxiv.org/abs/2505.17473

作者:Jiangning Zhu,Yuxing Zhou,Zheng Wang,Juntao Yao,Yima Gu,Yuhui Yuan,Shixia Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:chart understanding capabilities, communication contexts, increasingly critical, central role, capabilities of vision-language

备注

点击查看摘要

Abstract:Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce OrionBench, a benchmark designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 26,250 real and 78,750 synthetic infographics, with over 6.9 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of OrionBench through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

111. 【2505.17461】Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies

链接https://arxiv.org/abs/2505.17461

作者:Kazuki Hayashi,Shintaro Ozaki,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large-scale Vision Language, Large-scale Vision, real-world multimodal applications, involving complex visual, linguistic reasoning

备注

点击查看摘要

Abstract:Large-scale Vision Language Models (LVLMs) are increasingly being applied to a wide range of real-world multimodal applications, involving complex visual and linguistic reasoning. As these models become more integrated into practical use, they are expected to handle complex aspects of human interaction. Among these, color perception is a fundamental yet highly variable aspect of visual understanding. It differs across individuals due to biological factors such as Color Vision Deficiencies (CVDs), as well as differences in culture and language. Despite its importance, perceptual diversity has received limited attention. In our study, we evaluate LVLMs' ability to account for individual level perceptual variation using the Ishihara Test, a widely used method for detecting CVDs. Our results show that LVLMs can explain CVDs in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks. These findings highlight the need for multimodal systems that can account for color perceptual diversity and support broader discussions on perceptual inclusiveness and fairness in multimodal AI.

112. 【2505.17457】Graph Mamba for Efficient Whole Slide Image Understanding

链接https://arxiv.org/abs/2505.17457

作者:Jiaxuan Lu,Junyan Shi,Yuhui Lin,Fang Yan,Yue Gao,Shaoting Zhang,Xiaosong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:complex tile relationships, Slide Images, large-scale medical image, medical image analysis, image analysis due

备注

点击查看摘要

Abstract:Whole Slide Images (WSIs) in histopathology present a significant challenge for large-scale medical image analysis due to their high resolution, large size, and complex tile relationships. Existing Multiple Instance Learning (MIL) methods, such as Graph Neural Networks (GNNs) and Transformer-based models, face limitations in scalability and computational cost. To bridge this gap, we propose the WSI-GMamba framework, which synergistically combines the relational modeling strengths of GNNs with the efficiency of Mamba, the State Space Model designed for sequence learning. The proposed GMamba block integrates Message Passing, Graph Scanning Flattening, and feature aggregation via a Bidirectional State Space Model (Bi-SSM), achieving Transformer-level performance with 7* fewer FLOPs. By leveraging the complementary strengths of lightweight GNNs and Mamba, the WSI-GMamba framework delivers a scalable solution for large-scale WSI analysis, offering both high accuracy and computational efficiency for slide-level classification.

113. 【2505.17449】Real-time Traffic Accident Anticipation with Feature Reuse

链接https://arxiv.org/abs/2505.17449

作者:Inpyo Song,Jangwon Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:anticipating traffic accidents, addresses the problem, problem of anticipating, anticipating traffic, aims to forecast

备注: Accepted to ICIP 2025

点击查看摘要

Abstract:This paper addresses the problem of anticipating traffic accidents, which aims to forecast potential accidents before they happen. Real-time anticipation is crucial for safe autonomous driving, yet most methods rely on computationally heavy modules like optical flow and intermediate feature extractors, making real-world deployment challenging. In this paper, we thus introduce RARE (Real-time Accident anticipation with Reused Embeddings), a lightweight framework that capitalizes on intermediate features from a single pre-trained object detector. By eliminating additional feature-extraction pipelines, RARE significantly reduces latency. Furthermore, we introduce a novel Attention Score Ranking Loss, which prioritizes higher attention on accident-related objects over non-relevant ones. This loss enhances both accuracy and interpretability. RARE demonstrates a 4-8 times speedup over existing approaches on the DAD and CCD benchmarks, achieving a latency of 13.6ms per frame (73.3 FPS) on an RTX 6000. Moreover, despite its reduced complexity, it attains state-of-the-art Average Precision and reliably anticipates imminent collisions in real time. These results highlight RARE's potential for safety-critical applications where timely and explainable anticipation is essential.

114. 【2505.17448】Baitradar: A Multi-Model Clickbait Detection Algorithm Using Deep Learning

链接https://arxiv.org/abs/2505.17448

作者:Bhanuka Gamage,Adnan Labib,Aisha Joomun,Chern Hong Lim,KokSheik Wong

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:platform called clickbait, rising popularity, emerging problem, provokes users, called clickbait

备注: Appear in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'21), Toronto, ON, Canada

点击查看摘要

Abstract:Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of less than 2s.

115. 【2505.17445】PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints

链接https://arxiv.org/abs/2505.17445

作者:Inpyo Song,Hyemin Hwang,Jangwon Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:United States, ownership has reached, rise annually, households and continues, continues to rise

备注: Accepted to ICIP 2025

点击查看摘要

Abstract:In the United States, as of 2023, pet ownership has reached 66% of households and continues to rise annually. This trend underscores the critical need for effective pet identification and monitoring methods, particularly as nearly 10 million cats and dogs are reported stolen or lost each year. However, traditional methods for finding lost animals like GPS tags or ID photos have limitations-they can be removed, face signal issues, and depend on someone finding and reporting the pet. To address these limitations, we introduce PawPrint and PawPrint+, the first publicly available datasets focused on individual-level footprint identification for dogs and cats. Through comprehensive benchmarking of both modern deep neural networks (e.g., CNN, Transformers) and classical local features, we observe varying advantages and drawbacks depending on substrate complexity and data availability. These insights suggest future directions for combining learned global representations with local descriptors to enhance reliability across diverse, real-world conditions. As this approach provides a non-invasive alternative to traditional ID tags, we anticipate promising applications in ethical pet management and wildlife conservation efforts.

116. 【2505.17442】Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

链接https://arxiv.org/abs/2505.17442

作者:Hao Jing,Anhong Wang,Yifan Zhang,Donghan Bu,Junhui Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:facilitating real-time collaborative, real-time collaborative perception, intelligent transportation systems, compressed point clouds, vehicle networking

备注

点击查看摘要

Abstract:Regarding intelligent transportation systems for vehicle networking, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among vehicles with restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our RPKD framework jointly trains detectors on both raw and compressed point clouds to improve the student detector's robustness. Experimental results on the KITTI dataset and Waymo Open Dataset demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. Notably, at a low code rate of 2.146 Bpp on the KITTI dataset, our RPKD-PV achieves the highest mAP of 73.6, outperforming existing detection methods with the PV-RCNN baseline.

117. 【2505.17440】VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

链接https://arxiv.org/abs/2505.17440

作者:Hefei Mei,Zirui Wang,Shen You,Minjing Dong,Chang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant robustness concerns, demonstrated remarkable capabilities, raises significant robustness, Large Vision-Language Models, attacks raises significant

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps. The code is available at this https URL

118. 【2505.17437】Learning Generalized and Flexible Trajectory Models from Omni-Semantic Supervision

链接https://arxiv.org/abs/2505.17437

作者:Yuanshao Zhu,James Jianqiao Yu,Xiangyu Zhao,Xiao Han,Qidong Liu,Xuetao Wei,Yuxuan Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:spatio-temporal data mining, presenting significant challenges, data collection technologies, trajectory retrieval, presenting significant

备注: Accepted as a full paper by KDD'25 - Research Track

点击查看摘要

Abstract:The widespread adoption of mobile devices and data collection technologies has led to an exponential increase in trajectory data, presenting significant challenges in spatio-temporal data mining, particularly for efficient and accurate trajectory retrieval. However, existing methods for trajectory retrieval face notable limitations, including inefficiencies in large-scale data, lack of support for condition-based queries, and reliance on trajectory similarity measures. To address the above challenges, we propose OmniTraj, a generalized and flexible omni-semantic trajectory retrieval framework that integrates four complementary modalities or semantics -- raw trajectories, topology, road segments, and regions -- into a unified system. Unlike traditional approaches that are limited to computing and processing trajectories as a single modality, OmniTraj designs dedicated encoders for each modality, which are embedded and fused into a shared representation space. This design enables OmniTraj to support accurate and flexible queries based on any individual modality or combination thereof, overcoming the rigidity of traditional similarity-based methods. Extensive experiments on two real-world datasets demonstrate the effectiveness of OmniTraj in handling large-scale data, providing flexible, multi-modality queries, and supporting downstream tasks and applications.

119. 【2505.17425】Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

链接https://arxiv.org/abs/2505.17425

作者:Wei Jie Yeo,Rui Mao,Moloud Abdar,Erik Cambria,Ranjan Satapathy

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal models, gained significant attention, significant attention due, remarkable zero-shot performance, gained significant

备注: Under review

点击查看摘要

Abstract:Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at this https URL.

120. 【2505.17423】VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR

链接https://arxiv.org/abs/2505.17423

作者:Shenghui Chen,Po-han Li,Sandeep Chichali,Ufuk Topcu

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)

关键词:require human supervision, efficiency matter, Information Bottleneck Evaluation, human supervision, VLM

备注

点击查看摘要

Abstract:Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries' utility in downstream tasks. We address these gaps with Video-to-text Information Bottleneck Evaluation (VIBE), an annotation-free method that scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on LearningPaper24, SUTD-TrafficQA, and LongVideoBench show that summaries selected by VIBE consistently improve performance-boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.

121. 【2505.17412】Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

链接https://arxiv.org/abs/2505.17412

作者:Shuang Wu,Youtian Lin,Feihu Zhang,Yifei Zeng,Yikang Yang,Yajie Bao,Jiachen Qian,Siyu Zhu,Philip Torr,Xun Cao,Yao Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Signed Distance Functions, Distance Functions presents, Functions presents substantial, Signed Distance, Distance Functions

备注: Project page: [this https URL](https://nju3dv.github.io/projects/Direct3D-S2/)

点击查看摘要

Abstract:Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: this https URL.

122. 【2505.17402】From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation

链接https://arxiv.org/abs/2505.17402

作者:Mahmoud Chick Zaouali,Todd Charter,Homayoun Najjaran

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:structural assessment, infrastructure monitoring, environmental surveying, aerial inspection tasks, High-fidelity

备注

点击查看摘要

Abstract:High-fidelity 3D reconstruction is critical for aerial inspection tasks such as infrastructure monitoring, structural assessment, and environmental surveying. While traditional photogrammetry techniques enable geometric modeling, they lack semantic interpretability, limiting their effectiveness for automated inspection workflows. Recent advances in neural rendering and 3D Gaussian Splatting (3DGS) offer efficient, photorealistic reconstructions but similarly lack scene-level understanding. In this work, we present a UAV-based pipeline that extends Feature-3DGS for language-guided 3D segmentation. We leverage LSeg-based feature fields with CLIP embeddings to generate heatmaps in response to language prompts. These are thresholded to produce rough segmentations, and the highest-scoring point is then used as a prompt to SAM or SAM2 for refined 2D segmentation on novel view renderings. Our results highlight the strengths and limitations of various feature field backbones (CLIP-LSeg, SAM, SAM2) in capturing meaningful structure in large-scale outdoor environments. We demonstrate that this hybrid approach enables flexible, language-driven interaction with photorealistic 3D reconstructions, opening new possibilities for semantic aerial inspection and scene understanding.

Subjects:

Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Cite as:
arXiv:2505.17402 [cs.GR]

(or
arXiv:2505.17402v1 [cs.GR] for this version)

https://doi.org/10.48550/arXiv.2505.17402

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
123. 【2505.17395】Wildfire Detection Using Vision Transformer with the Wildfire Dataset

链接https://arxiv.org/abs/2505.17395

作者:Gowtham Raj Vuppari,Navarun Gupta,Ahmed El-Sayed,Xingguo Xiong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:sophisticated detection techniques, rising frequency, frequency and intensity, Los Angeles wildfires, California

备注: Published at ASEE NE 2025

点击查看摘要

Abstract:The critical need for sophisticated detection techniques has been highlighted by the rising frequency and intensity of wildfires in the US, especially in California. In 2023, wildfires caused 130 deaths nationwide, the highest since 1990. In January 2025, Los Angeles wildfires which included the Palisades and Eaton fires burnt approximately 40,000 acres and 12,000 buildings, and caused loss of human lives. The devastation underscores the urgent need for effective detection and prevention strategies. Deep learning models, such as Vision Transformers (ViTs), can enhance early detection by processing complex image data with high accuracy. However, wildfire detection faces challenges, including the availability of high-quality, real-time data. Wildfires often occur in remote areas with limited sensor coverage, and environmental factors like smoke and cloud cover can hinder detection. Additionally, training deep learning models is computationally expensive, and issues like false positives/negatives and scaling remain concerns. Integrating detection systems with real-time alert mechanisms also poses difficulties. In this work, we used the wildfire dataset consisting of 10.74 GB high-resolution images categorized into 'fire' and 'nofire' classes is used for training the ViT model. To prepare the data, images are resized to 224 x 224 pixels, converted into tensor format, and normalized using ImageNet statistics.

124. 【2505.17392】Dual-sensing driving detection model

链接https://arxiv.org/abs/2505.17392

作者:Leon C.C.K,Zeng Hui

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:combining computer vision, method combining computer, detection method combining, dual-sensing driver fatigue, combining computer

备注: 19 pages

点击查看摘要

Abstract:In this paper, a novel dual-sensing driver fatigue detection method combining computer vision and physiological signal analysis is proposed. The system exploits the complementary advantages of the two sensing modalities and breaks through the limitations of existing single-modality methods. We introduce an innovative architecture that combines real-time facial feature analysis with physiological signal processing, combined with advanced fusion strategies, for robust fatigue detection. The system is designed to run efficiently on existing hardware while maintaining high accuracy and reliability. Through comprehensive experiments, we demonstrate that our method outperforms traditional methods in both controlled environments and real-world conditions, while maintaining high accuracy. The practical applicability of the system has been verified through extensive tests in various driving scenarios and shows great potential in reducing fatigue-related accidents. This study contributes to the field by providing a more reliable, cost-effective, and humane solution for driver fatigue detection.

125. 【2505.17384】Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

链接https://arxiv.org/abs/2505.17384

作者:Tianyu Xie,Shuchen Xue,Zijin Feng,Tianyang Hu,Jiacheng Sun,Zhenguo Li,Cheng Zhang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:recently shown great, shown great promise, Discrete diffusion, Autoencoding Discrete Diffusion, offering a compelling

备注: 23 pages, 14 figures

点击查看摘要

Abstract:Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.

126. 【2505.17367】EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion

链接https://arxiv.org/abs/2505.17367

作者:Zichuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generalizability remain challenging, Explainable Vision Mamba, Medical image classification, Medical image, multi-organ medical image

备注: 16 pages, 4 figures

点击查看摘要

Abstract:Medical image classification is critical for clinical decision-making, yet demands for accuracy, interpretability, and generalizability remain challenging. This paper introduces EVM-Fusion, an Explainable Vision Mamba architecture featuring a novel Neural Algorithmic Fusion (NAF) mechanism for multi-organ medical image classification. EVM-Fusion leverages a multipath design, where DenseNet and U-Net based pathways, enhanced by Vision Mamba (Vim) modules, operate in parallel with a traditional feature pathway. These diverse features are dynamically integrated via a two-stage fusion process: cross-modal attention followed by the iterative NAF block, which learns an adaptive fusion algorithm. Intrinsic explainability is embedded through path-specific spatial attention, Vim {\Delta}-value maps, traditional feature SE-attention, and cross-modal attention weights. Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion's strong classification performance, achieving 99.75% test accuracy and provide multi-faceted insights into its decision-making process, highlighting its potential for trustworthy AI in medical diagnostics.

127. 【2505.17364】Optimizing YOLOv8 for Parking Space Detection: Comparative Analysis of Custom YOLOv8 Architecture

链接https://arxiv.org/abs/2505.17364

作者:Apar Pokhrel,Gia Dao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:parking management systems, intelligent parking management, management systems, critical component, development of intelligent

备注: 9 pages

点击查看摘要

Abstract:Parking space occupancy detection is a critical component in the development of intelligent parking management systems. Traditional object detection approaches, such as YOLOv8, provide fast and accurate vehicle detection across parking lots but can struggle with borderline cases, such as partially visible vehicles, small vehicles (e.g., motorcycles), and poor lighting conditions. In this work, we perform a comprehensive comparative analysis of customized backbone architectures integrated with YOLOv8. Specifically, we evaluate various backbones -- ResNet-18, VGG16, EfficientNetV2, Ghost -- on the PKLot dataset in terms of detection accuracy and computational efficiency. Experimental results highlight each architecture's strengths and trade-offs, providing insight into selecting suitable models for parking occupancy.

128. 【2505.17363】Are GNNs Worth the Effort for IoT Botnet Detection? A Comparative Study of VAE-GNN vs. ViT-MLP and VAE-MLP Approaches

链接https://arxiv.org/abs/2505.17363

作者:Hassan Wasswa,Hussein Abbass,Timothy Lynar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Graph Neural Networks, Graph Convolutional Networks, Graph Attention Networks, enhance IoT security, VAE encoder

备注

点击查看摘要

Abstract:Due to the exponential rise in IoT-based botnet attacks, researchers have explored various advanced techniques for both dimensionality reduction and attack detection to enhance IoT security. Among these, Variational Autoencoders (VAE), Vision Transformers (ViT), and Graph Neural Networks (GNN), including Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), have garnered significant research attention in the domain of attack detection. This study evaluates the effectiveness of four state-of-the-art deep learning architectures for IoT botnet detection: a VAE encoder with a Multi-Layer Perceptron (MLP), a VAE encoder with a GCN, a VAE encoder with a GAT, and a ViT encoder with an MLP. The evaluation is conducted on a widely studied IoT benchmark dataset--the N-BaIoT dataset for both binary and multiclass tasks. For the binary classification task, all models achieved over 99.93% in accuracy, recall, precision, and F1-score, with no notable differences in performance. In contrast, for the multiclass classification task, GNN-based models showed significantly lower performance compared to VAE-MLP and ViT-MLP, with accuracies of 86.42%, 89.46%, 99.72%, and 98.38% for VAE-GCN, VAE-GAT, VAE-MLP, and ViT-MLP, respectively.

129. 【2505.17358】Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

链接https://arxiv.org/abs/2505.17358

作者:Chinmay Talegaonkar,Nikhil Gandudi Suresh,Zachary Novack,Yash Belhe,Priyanka Nagasamudra,Nicholas Antipa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made notable progress, Recent monocular metric, Recent monocular, made notable, notable progress

备注

点击查看摘要

Abstract:Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textit{pre-trained} diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.

130. 【2505.17357】Graph Attention Neural Network for Botnet Detection: Evaluating Autoencoder, VAE and PCA-Based Dimension Reduction

链接https://arxiv.org/abs/2505.17357

作者:Hassan Wasswa,Hussein Abbass,Timothy Lynar

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:including traditional machine, traditional machine learning, researchers have explored, including traditional, hybrid approaches

备注

点击查看摘要

Abstract:With the rise of IoT-based botnet attacks, researchers have explored various learning models for detection, including traditional machine learning, deep learning, and hybrid approaches. A key advancement involves deploying attention mechanisms to capture long-term dependencies among features, significantly improving detection accuracy. However, most models treat attack instances independently, overlooking inter-instance relationships. Graph Neural Networks (GNNs) address this limitation by learning an embedding space via iterative message passing where similar instances are placed closer based on node features and relationships, enhancing classification performance. To further improve detection, attention mechanisms have been embedded within GNNs, leveraging both long-range dependencies and inter-instance connections. However, transforming the high dimensional IoT attack datasets into a graph structured dataset poses challenges, such as large graph structures leading computational overhead. To mitigate this, this paper proposes a framework that first reduces dimensionality of the NetFlow-based IoT attack dataset before transforming it into a graph dataset. We evaluate three dimension reduction techniques--Variational Autoencoder (VAE-encoder), classical autoencoder (AE-encoder), and Principal Component Analysis (PCA)--and compare their effects on a Graph Attention neural network (GAT) model for botnet attack detection

131. 【2505.17353】Dual Ascent Diffusion for Inverse Problems

链接https://arxiv.org/abs/2505.17353

作者:Minseo Kim,Axel Levy,Gordon Wetzstein

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Ill-posed inverse problems, Ill-posed inverse, ranging from astrophysics, medical imaging, astrophysics to medical

备注: 23 pages, 15 figures, 5 tables

点击查看摘要

Abstract:Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.

132. 【2505.17352】Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

链接https://arxiv.org/abs/2505.17352

作者:Preeti Lamba,Kiran Ravish,Ankita Kushwaha,Pawan Kumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion models, safety constraints remains, leading generative models, align diffusion models, emerged as leading

备注

点击查看摘要

Abstract:Diffusion models have emerged as leading generative models for images and other modalities, but aligning their outputs with human preferences and safety constraints remains a critical challenge. This thesis proposal investigates methods to align diffusion models using reinforcement learning (RL) and reward modeling. We survey recent advances in fine-tuning text-to-image diffusion models with human feedback, including reinforcement learning from human and AI feedback, direct preference optimization, and differentiable reward approaches. We classify these methods based on the type of feedback (human, automated, binary or ranked preferences), the fine-tuning technique (policy gradient, reward-weighted likelihood, direct backpropagation, etc.), and their efficiency and safety outcomes. We compare key algorithms and frameworks, highlighting how they improve alignment with user intent or safety standards, and discuss inter-relationships such as how newer methods build on or diverge from earlier ones. Based on the survey, we identify five promising research directions for the next two years: (1) multi-objective alignment with combined rewards, (2) efficient human feedback usage and active learning, (3) robust safety alignment against adversarial inputs, (4) continual and online alignment of diffusion models, and (5) interpretable and trustworthy reward modeling for generative images. Each direction is elaborated with its problem statement, challenges, related work, and a proposed research plan. The proposal is organized as a comprehensive document with literature review, comparative tables of methods, and detailed research plans, aiming to contribute new insights and techniques for safer and value-aligned diffusion-based generative AI.

133. 【2505.17343】Ocular Authentication: Fusion of Gaze and Periocular Modalities

链接https://arxiv.org/abs/2505.17343

作者:Dillon Lohr,Michael J. Proulx,Mehedi Hasan Raju,Oleg V. Komogortsev

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:eye-centric authentication modalities-eye, authentication modalities-eye movements, calibration-free authentication system, paper investigates, investigates the feasibility

备注: Supplementary material is available

点击查看摘要

Abstract:This paper investigates the feasibility of fusing two eye-centric authentication modalities-eye movements and periocular images-within a calibration-free authentication system. While each modality has independently shown promise for user authentication, their combination within a unified gaze-estimation pipeline has not been thoroughly explored at scale. In this report, we propose a multimodal authentication system and evaluate it using a large-scale in-house dataset comprising 9202 subjects with an eye tracking (ET) signal quality equivalent to a consumer-facing virtual reality (VR) device. Our results show that the multimodal approach consistently outperforms both unimodal systems across all scenarios, surpassing the FIDO benchmark. The integration of a state-of-the-art machine learning architecture contributed significantly to the overall authentication performance at scale, driven by the model's ability to capture authentication representations and the complementary discriminative characteristics of the fused modalities.

134. 【2505.17338】Render-FM: A Foundation Model for Real-time Photorealistic Volumetric Rendering

链接https://arxiv.org/abs/2505.17338

作者:Zhongpai Gao,Meng Zheng,Benjamin Planche,Anwesa Choudhuri,Terrence Chen,Ziyan Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Computed Tomography, visualizing complex, anatomical structures, crucial for visualizing, rendering of Computed

备注

点击查看摘要

Abstract:Volumetric rendering of Computed Tomography (CT) scans is crucial for visualizing complex 3D anatomical structures in medical imaging. Current high-fidelity approaches, especially neural rendering techniques, require time-consuming per-scene optimization, limiting clinical applicability due to computational demands and poor generalizability. We propose Render-FM, a novel foundation model for direct, real-time volumetric rendering of CT scans. Render-FM employs an encoder-decoder architecture that directly regresses 6D Gaussian Splatting (6DGS) parameters from CT volumes, eliminating per-scan optimization through large-scale pre-training on diverse medical data. By integrating robust feature extraction with the expressive power of 6DGS, our approach efficiently generates high-quality, real-time interactive 3D visualizations across diverse clinical CT data. Experiments demonstrate that Render-FM achieves visual fidelity comparable or superior to specialized per-scan methods while drastically reducing preparation time from nearly an hour to seconds for a single inference step. This advancement enables seamless integration into real-time surgical planning and diagnostic workflows. The project page is: this https URL.

135. 【2505.17333】mporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis

链接https://arxiv.org/abs/2505.17333

作者:Xin You,Minghui Zhang,Hanxiao Zhang,Jie Yang,Nassir Navab

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image-guided clinical applications, clinical applications, Temporal, crucial to image-guided, image-guided clinical

备注: early accepted by MICCAI

点击查看摘要

Abstract:Temporal modeling on regular respiration-induced motions is crucial to image-guided clinical applications. Existing methods cannot simulate temporal motions unless high-dose imaging scans including starting and ending frames exist simultaneously. However, in the preoperative data acquisition stage, the slight movement of patients may result in dynamic backgrounds between the first and last frames in a respiratory period. This additional deviation can hardly be removed by image registration, thus affecting the temporal modeling. To address that limitation, we pioneeringly simulate the regular motion process via the image-to-video (I2V) synthesis framework, which animates with the first frame to forecast future frames of a given length. Besides, to promote the temporal consistency of animated videos, we devise the Temporal Differential Diffusion Model to generate temporal differential fields, which measure the relative differential representations between adjacent frames. The prompt attention layer is devised for fine-grained differential fields, and the field augmented layer is adopted to better interact these fields with the I2V framework, promoting more accurate temporal variation of synthesized videos. Extensive results on ACDC cardiac and 4D Lung datasets reveal that our approach simulates 4D videos along the intrinsic motion trajectory, rivaling other competitive methods on perceptual similarity and temporal consistency. Codes will be available soon.

136. 【2505.17330】FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

链接https://arxiv.org/abs/2505.17330

作者:Amit Agarwal,Srikant Panda,Kulbhushan Pachauri

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Domain Adapting Graph, Shot Domain Adapting, Adapting Graph, rich document understanding, visually rich document

备注: Published in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Industry Track, pages 100-114

点击查看摘要

Abstract:In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : this https URL

137. 【2505.17328】Game-invariant Features Through Contrastive and Domain-adversarial Learning

链接https://arxiv.org/abs/2505.17328

作者:Dylan Kline

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Foundational game-image encoders, game-specific visual styles, undermining performance, Foundational game-image, encoders often overfit

备注

点击查看摘要

Abstract:Foundational game-image encoders often overfit to game-specific visual styles, undermining performance on downstream tasks when applied to new games. We present a method that combines contrastive learning and domain-adversarial training to learn game-invariant visual features. By simultaneously encouraging similar content to cluster and discouraging game-specific cues via an adversarial domain classifier, our approach produces embeddings that generalize across diverse games. Experiments on the Bingsu game-image dataset (10,000 screenshots from 10 games) demonstrate that after only a few training epochs, our model's features no longer cluster by game, indicating successful invariance and potential for improved cross-game transfer (e.g., glitch detection) with minimal fine-tuning. This capability paves the way for more generalizable game vision models that require little to no retraining on new games.

138. 【2505.17317】Optimizing Image Capture for Computer Vision-Powered Taxonomic Identification and Trait Recognition of Biodiversity Specimens

链接https://arxiv.org/abs/2505.17317

作者:Alyson East,Elizabeth G. Campolongo,Luke Meyers,S M Rayeed,Samuel Stevens,Iuliia Zarubiieva,Isadora E. Fluck,Jennifer C. Girón,Maximiliane Jousse,Scott Lowe,Kayla I Perry,Isabelle Betancourt,Noah Charney,Evan Donoso,Nathan Fox,Kim J. Landsbergen,Ekaterina Nepovinnykh,Michelle Ramirez,Parkash Singh,Khum Thapa-Magar,Matthew Thompson,Evan Waite,Tanya Berger-Wolf,Hilmar Lapp,Paula Mabee,Graham Taylor,Sydne Record

类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:specimens documenting Earth, documenting Earth biodiversity, documenting Earth, collections house millions, digital images increasingly

备注

点击查看摘要

Abstract:Biological collections house millions of specimens documenting Earth's biodiversity, with digital images increasingly available through open-access platforms. Most imaging protocols were developed for human visual interpretation without considering computational analysis requirements. This paper aims to bridge the gap between current imaging practices and the potential for automated analysis by presenting key considerations for creating biological specimen images optimized for computer vision applications. We provide conceptual computer vision topics for context, addressing fundamental concerns including model generalization, data leakage, and comprehensive metadata documentation, and outline practical guidance on specimen imagine, and data storage. These recommendations were synthesized through interdisciplinary collaboration between taxonomists, collection managers, ecologists, and computer scientists. Through this synthesis, we have identified ten interconnected considerations that form a framework for successfully integrating biological specimen images into computer vision pipelines. The key elements include: (1) comprehensive metadata documentation, (2) standardized specimen positioning, (3) consistent size and color calibration, (4) protocols for handling multiple specimens in one image, (5) uniform background selection, (6) controlled lighting, (7) appropriate resolution and magnification, (8) optimal file formats, (9) robust data archiving strategies, and (10) accessible data sharing practices. By implementing these recommendations, collection managers, taxonomists, and biodiversity informaticians can generate images that support automated trait extraction, species identification, and novel ecological and evolutionary analyses at unprecedented scales. Successful implementation lies in thorough documentation of methodological choices.

139. 【2505.17316】Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

链接https://arxiv.org/abs/2505.17316

作者:Jiachen Jiang,Jinxin Zhou,Bo Peng,Xia Ning,Zhihui Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, pretrained vision encoders, powerful pretrained vision, vision encoder

备注

点击查看摘要

Abstract:Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment -- the alignment between each vision patch and its corresponding semantic words -- and propose a *multi-semantic alignment hypothesis*. Our analysis indicates that the projector trained by caption loss improves patch-level alignment but only to a limited extent, resulting in weak and coarse alignment. To address this issue, we propose *patch-aligned training* to efficiently enhance patch-level alignment. Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM's performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.

140. 【2505.17311】Harnessing EHRs for Diffusion-based Anomaly Detection on Chest X-rays

链接https://arxiv.org/abs/2505.17311

作者:Harim Kim,Yuhan Wang,Minkyu Ahn,Heeyoul Choi,Yuyin Zhou,Charmgil Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Unsupervised anomaly detection, identifying pathological abnormalities, extensive labeled data, Unsupervised anomaly, requiring extensive labeled

备注: MICCAI 2025 early accept

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) in medical imaging is crucial for identifying pathological abnormalities without requiring extensive labeled data. However, existing diffusion-based UAD models rely solely on imaging features, limiting their ability to distinguish between normal anatomical variations and pathological anomalies. To address this, we propose Diff3M, a multi-modal diffusion-based framework that integrates chest X-rays and structured Electronic Health Records (EHRs) for enhanced anomaly detection. Specifically, we introduce a novel image-EHR cross-attention module to incorporate structured clinical context into the image generation process, improving the model's ability to differentiate normal from abnormal features. Additionally, we develop a static masking strategy to enhance the reconstruction of normal-like images from anomalies. Extensive evaluations on CheXpert and MIMIC-CXR/IV demonstrate that Diff3M achieves state-of-the-art performance, outperforming existing UAD methods in medical imaging. Our code is available at this http URL this https URL

141. 【2505.17280】Mitigate One, Skew Another? Tackling Intersectional Biases in Text-to-Image Models

链接https://arxiv.org/abs/2505.17280

作者:Pushkar Shukla,Aditya Chinchure,Emily Diana,Alexander Tolbert,Kartik Hosanagar,Vineeth N Balasubramanian,Leonid Sigal,Matthew Turk

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:treated as independent, deeply interrelated, biases exhibited, TTI models, TTI

备注

点击查看摘要

Abstract:The biases exhibited by text-to-image (TTI) models are often treated as independent, though in reality, they may be deeply interrelated. Addressing bias along one dimension - such as ethnicity or age - can inadvertently affect another, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. To address this, we introduce BiasConnect, a novel tool for analyzing and quantifying bias interactions in TTI models. BiasConnect uses counterfactual interventions along different bias axes to reveal the underlying structure of these interactions and estimates the effect of mitigating one bias axis on another. These estimates show strong correlation (+0.65) with observed post-mitigation outcomes. Building on BiasConnect, we propose InterMit, an intersectional bias mitigation algorithm guided by user-defined target distributions and priority weights. InterMit achieves lower bias (0.33 vs. 0.52) with fewer mitigation steps (2.38 vs. 3.15 average steps), and yields superior image quality compared to traditional techniques. Although our implementation is training-free, InterMit is modular and can be integrated with many existing debiasing approaches for TTI models, making it a flexible and extensible solution.

142. 【2505.17256】ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation

链接https://arxiv.org/abs/2505.17256

作者:Liang Shi,Yun Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, achieving fine-grained control, facial features remains, significantly improved, remains a challenge

备注

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved text-to-face generation, but achieving fine-grained control over facial features remains a challenge. Existing methods often require training additional modules to handle specific controls such as identity, attributes, or age, making them inflexible and resource-intensive. We propose ExpertGen, a training-free framework that leverages pre-trained expert models such as face recognition, facial attribute recognition, and age estimation networks to guide generation with fine control. Our approach uses a latent consistency model to ensure realistic and in-distribution predictions at each diffusion step, enabling accurate guidance signals to effectively steer the diffusion process. We show qualitatively and quantitatively that expert models can guide the generation process with high precision, and multiple experts can collaborate to enable simultaneous control over diverse facial aspects. By allowing direct integration of off-the-shelf expert models, our method transforms any such model into a plug-and-play component for controllable face generation.

143. 【2505.17245】Extending Dataset Pruning to Object Detection: A Variance-based Approach

链接https://arxiv.org/abs/2505.17245

作者:Ryota Yagi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:efficient machine learning, offering significant reductions, machine learning, offering significant, Scoring Strategy Problem

备注

点击查看摘要

Abstract:Dataset pruning -- selecting a small yet informative subset of training data -- has emerged as a promising strategy for efficient machine learning, offering significant reductions in computational cost and storage compared to alternatives like dataset distillation. While pruning methods have shown strong performance in image classification, their extension to more complex computer vision tasks, particularly object detection, remains relatively underexplored. In this paper, we present the first principled extension of classification pruning techniques to the object detection domain, to the best of our knowledge. We identify and address three key challenges that hinder this transition: the Object-Level Attribution Problem, the Scoring Strategy Problem, and the Image-Level Aggregation Problem. To overcome these, we propose tailored solutions, including a novel scoring method called Variance-based Prediction Score (VPS). VPS leverages both Intersection over Union (IoU) and confidence scores to effectively identify informative training samples specific to detection tasks. Extensive experiments on PASCAL VOC and MS COCO demonstrate that our approach consistently outperforms prior dataset pruning methods in terms of mean Average Precision (mAP). We also show that annotation count and class distribution shift can influence detection performance, but selecting informative examples is a more critical factor than dataset size or balance. Our work bridges dataset pruning and object detection, paving the way for dataset pruning in complex vision tasks.

144. 【2505.17235】CHAOS: Chart Analysis with Outlier Samples

链接https://arxiv.org/abs/2505.17235

作者:Omar Moured,Yufan Chen,Ruiping Liu,Simon Reiß,Philip Torr,Jiaming Zhang,Rainer Stiefelhagen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, noisy features, real-world applications, applications often present

备注: Data and code are publicly available at: [this http URL](http://huggingface.co/datasets/omoured/CHAOS)

点击查看摘要

Abstract:Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, "outlier charts" pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: this http URL.

145. 【2505.17223】REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge

链接https://arxiv.org/abs/2505.17223

作者:Siyang Song,Micol Spitale,Xiangyu Kong,Hengde Zhu,Cheng Luo,Cristina Palmero,German Barquero,Sergio Escalera,Michel Valstar,Mohamed Daoudi,Tobias Baur,Fabien Ringeval,Andrew Howes,Elisabeth Andre,Hatice Gunes

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human facial reactions, human speaker behaviour, facial reactions expressed, human-style facial reactions, facial reactions

备注

点击查看摘要

Abstract:In dyadic interactions, a broad spectrum of human facial reactions might be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023 and REACT 2024 challenges, we are proposing the REACT 2025 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can be used to generate multiple appropriate, diverse, realistic and synchronised human-style facial reactions expressed by human listeners in response to an input stimulus (i.e., audio-visual behaviours expressed by their corresponding speakers). As a key of the challenge, we provide challenge participants with the first natural and large-scale multi-modal MAFRG dataset (called MARS) recording 137 human-human dyadic interactions containing a total of 2856 interaction sessions covering five different topics. In addition, this paper also presents the challenge guidelines and the performance of our baselines on the two proposed sub-challenges: Offline MAFRG and Online MAFRG, respectively. The challenge baseline code is publicly available at this https URL

146. 【2505.17202】CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

链接https://arxiv.org/abs/2505.17202

作者:Arnav Verma,Kushin Mukherjee,Christopher Potts,Elisa Kreiss,Judith E. Fan

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Data, powerful tools, tools for communicating, Data visualizations, models

备注

点击查看摘要

Abstract:Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: this https URL.

147. 【2505.17201】A Framework for Multi-View Multiple Object Tracking using Single-View Multi-Object Trackers on Fish Data

链接https://arxiv.org/abs/2505.17201

作者:Chaim Chai Elchik,Fatemeh Karimi Nejadasl,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:environments presents unique, single-view MOT models, presents unique challenges, unique challenges due, made significant advancements

备注

点击查看摘要

Abstract:Multi-object tracking (MOT) in computer vision has made significant advancements, yet tracking small fish in underwater environments presents unique challenges due to complex 3D motions and data noise. Traditional single-view MOT models often fall short in these settings. This thesis addresses these challenges by adapting state-of-the-art single-view MOT models, FairMOT and YOLOv8, for underwater fish detecting and tracking in ecological studies. The core contribution of this research is the development of a multi-view framework that utilizes stereo video inputs to enhance tracking accuracy and fish behavior pattern recognition. By integrating and evaluating these models on underwater fish video datasets, the study aims to demonstrate significant improvements in precision and reliability compared to single-view approaches. The proposed framework detects fish entities with a relative accuracy of 47% and employs stereo-matching techniques to produce a novel 3D output, providing a more comprehensive understanding of fish movements and interactions

148. 【2505.17167】CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

链接https://arxiv.org/abs/2505.17167

作者:Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Bernhard Kainz,Bjoern Menze

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Evaluating long-context radiology, Evaluating long-context, radiology report generation, long-context radiology report, generation is challenging

备注

点击查看摘要

Abstract:Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.

149. 【2505.17163】OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

链接https://arxiv.org/abs/2505.17163

作者:Mingxin Huang,Yongxin Shi,Dezhi Peng,Songxuan Lai,Zecheng Xie,Lianwen Jin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable performance, Recent advancements, multimodal slow-thinking systems, text-rich image reasoning, Multimodal Large Language

备注

点击查看摘要

Abstract:Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at this https URL.

150. 【2505.17132】Robustifying Vision-Language Models via Dynamic Token Reweighting

链接https://arxiv.org/abs/2505.17132

作者:Tanqiu Jiang,Jiacheng Liang,Rongyi Zhu,Jiawei Zhou,Fenglong Ma,Ting Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large vision-language models, exploit visual-textual interactions, Large vision-language, bypass safety guardrails, highly vulnerable

备注

点击查看摘要

Abstract:Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model's key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model's general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that \sys outperforms existing defenses in both attack robustness and benign task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available: this https URL (warning: this paper contains potentially harmful content generated by VLMs.)

151. 【2505.17127】Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

链接https://arxiv.org/abs/2505.17127

作者:Michal Golovanevsky,William Rudman,Michael Lepori,Amir Bar,Ritambhara Singh,Carsten Eickhoff

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Large Language, visual question answering, visual information present

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

152. 【2505.17114】RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

链接https://arxiv.org/abs/2505.17114

作者:Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Multimodal question answering, Multimodal question, question answering, identifying which video, requires identifying

备注

点击查看摘要

Abstract:Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at this https URL.

153. 【2505.17098】ACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

链接https://arxiv.org/abs/2505.17098

作者:Yanshu Li,Tian Yun,Jianjiang Yang,Pinyuan Feng,Jinfa Huang,Ruixiang Tang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision-language models, Multimodal in-context learning, key mechanism, mechanism for harnessing, harnessing the capabilities

备注: 29 pages, 11 figures, 19 tables. arXiv admin note: substantial text overlap with [arXiv:2503.04839](https://arxiv.org/abs/2503.04839)

点击查看摘要

Abstract:Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.

154. 【2505.17097】CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention

链接https://arxiv.org/abs/2505.17097

作者:Yanshu Li,JianJiang Yang,Bozheng Li,Ruixiang Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:enables large vision-language, large vision-language models, Multimodal in-context learning, in-context learning, enables large

备注: 10 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Multimodal in-context learning (ICL) enables large vision-language models (LVLMs) to efficiently adapt to novel tasks, supporting a wide array of real-world applications. However, multimodal ICL remains unstable, and current research largely focuses on optimizing sequence configuration while overlooking the internal mechanisms of LVLMs. In this work, we first provide a theoretical analysis of attentional dynamics in multimodal ICL and identify three core limitations of standard attention that ICL impair performance. To address these challenges, we propose Context-Aware Modulated Attention (CAMA), a simple yet effective plug-and-play method for directly calibrating LVLM attention logits. CAMA is training-free and can be seamlessly applied to various open-source LVLMs. We evaluate CAMA on four LVLMs across six benchmarks, demonstrating its effectiveness and generality. CAMA opens new opportunities for deeper exploration and targeted utilization of LVLM attention dynamics to advance multimodal reasoning.

155. 【2505.17091】Large Language Models Implicitly Learn to See and Hear Just By Reading

链接https://arxiv.org/abs/2505.17091

作者:Prateek Verma,Mert Pilanci

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:inherently develops internally, auto-regressive LLM model, model inherently develops, LLM models, auto-regressive LLM

备注: 6 pages, 3 figures, 4 tables. Under Review WASPAA 2025

点击查看摘要

Abstract:This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

156. 【2505.17090】EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

链接https://arxiv.org/abs/2505.17090

作者:Phoebe Chua,Cathy Mengying Fang,Takehiko Ohkawa,Raja Kushalnagar,Suranga Nanayakkara,Pattie Maes

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unlike spoken languages, remain poorly understood, creating communication barriers, Unlike spoken, language remain poorly

备注

点击查看摘要

Abstract:Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at this https URL.

157. 【2505.17064】Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

链接https://arxiv.org/abs/2505.17064

作者:Maria-Teresa De Rosa Palmini,Eva Cetinic

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:content creation, growing attention, increasingly influential, influential in content, cultural implications

备注

点击查看摘要

Abstract:As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. In this work, we present a systematic and reproducible methodology for evaluating how TTI systems depict different historical periods. For this purpose, we introduce the HistVis dataset, a curated collection of 30,000 synthetic images generated by three state-of-the-art diffusion models using carefully designed prompts depicting universal human activities across different historical periods. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By offering a scalable methodology and benchmark for assessing historical representation in generated imagery, this work provides an initial step toward building more historically accurate and culturally aligned TTI models.

158. 【2505.17061】Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

链接https://arxiv.org/abs/2505.17061

作者:Xinlong Chen,Yuanxing Zhang,Qiang Liu,Junfei Wu,Fuzheng Zhang,Tieniu Tan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, exhibited impressive capabilities, Large Vision-Language, visual tasks, exhibited impressive

备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at this https URL.

159. 【2505.17042】VLM-KG: Multimodal Radiology Knowledge Graph Generation

链接https://arxiv.org/abs/2505.17042

作者:Abdullah Abdullah,Seong Tae Kim

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Vision-Language Models, demonstrated remarkable success, excelling at instruction, structured output generation, demonstrated remarkable

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable success in natural language generation, excelling at instruction following and structured output generation. Knowledge graphs play a crucial role in radiology, serving as valuable sources of factual information and enhancing various downstream tasks. However, generating radiology-specific knowledge graphs presents significant challenges due to the specialized language of radiology reports and the limited availability of domain-specific data. Existing solutions are predominantly unimodal, meaning they generate knowledge graphs only from radiology reports while excluding radiographic images. Additionally, they struggle with long-form radiology data due to limited context length. To address these limitations, we propose a novel multimodal VLM-based framework for knowledge graph generation in radiology. Our approach outperforms previous methods and introduces the first multimodal solution for radiology knowledge graph generation.

160. 【2505.07444】Lightweight Multispectral Crop-Weed Segmentation for Precision Agriculture

链接https://arxiv.org/abs/2505.07444

作者:Zeynep Galymzhankyzy,Eric Martinson

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:Efficient crop-weed segmentation, Efficient crop-weed, site-specific weed control, critical for site-specific, Efficient

备注: 4 pages, 5 figures, 1 table

点击查看摘要

Abstract:Efficient crop-weed segmentation is critical for site-specific weed control in precision agriculture. Conventional CNN-based methods struggle to generalize and rely on RGB imagery, limiting performance under complex field conditions. To address these challenges, we propose a lightweight transformer-CNN hybrid. It processes RGB, Near-Infrared (NIR), and Red-Edge (RE) bands using specialized encoders and dynamic modality integration. Evaluated on the WeedsGalore dataset, the model achieves a segmentation accuracy (mean IoU) of 78.88%, outperforming RGB-only models by 15.8 percentage points. With only 8.7 million parameters, the model offers high accuracy, computational efficiency, and potential for real-time deployment on Unmanned Aerial Vehicles (UAVs) and edge devices, advancing precision weed management.

161. 【2505.18107】Accelerating Learned Image Compression Through Modeling Neural Training Dynamics

链接https://arxiv.org/abs/2505.18107

作者:Yichi Zhang,Zhihao Duan,Yuning Huang,Fengqing Zhu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:learned image compression, increasingly computationally demanding, image compression, computationally demanding, efficiency is crucial

备注: Accepted to TMLR

点击查看摘要

Abstract:As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate modes where parameters are expressed as affine transformations of reference parameters within the same mode. By further utilizing the stable intra-mode correlations throughout training and parameter sensitivities, we gradually embed non-reference parameters, reducing the number of trainable parameters. Additionally, we incorporate a Sampling-then-Moving Average (SMA) technique, interpolating sampled weights from stochastic gradient descent (SGD) training to obtain the moving average weights, ensuring smooth temporal behavior and minimizing training state variances. Overall, our method significantly reduces training space dimensions and the number of trainable parameters without sacrificing model performance, thus accelerating model convergence. We also provide a theoretical analysis on the Noisy quadratic model, showing that the proposed method achieves a lower training variance than standard SGD. Our approach offers valuable insights for further developing efficient training methods for LICs.

162. 【2505.18058】A Foundation Model Framework for Multi-View MRI Classification of Extramural Vascular Invasion and Mesorectal Fascia Invasion in Rectal Cancer

链接https://arxiv.org/abs/2505.18058

作者:Yumeng Zhang,Zohaib Salahuddin,Danial Khan,Shruti Atul Mali,Henry C. Woodruff,Sina Amirrajab,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Luis Marti-Bonmati,Philippe Lambin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate MRI-based identification, extramural vascular invasion, mesorectal fascia invasion, Accurate MRI-based, vascular invasion

备注: 22 pages, 8 figures

点击查看摘要

Abstract:Background: Accurate MRI-based identification of extramural vascular invasion (EVI) and mesorectal fascia invasion (MFI) is pivotal for risk-stratified management of rectal cancer, yet visual assessment is subjective and vulnerable to inter-institutional variability. Purpose: To develop and externally evaluate a multicenter, foundation-model-driven framework that automatically classifies EVI and MFI on axial and sagittal T2-weighted MRI. Methods: This retrospective study used 331 pre-treatment rectal cancer MRI examinations from three European hospitals. After TotalSegmentator-guided rectal patch extraction, a self-supervised frequency-domain harmonization pipeline was trained to minimize scanner-related contrast shifts. Four classifiers were compared: ResNet50, SeResNet, the universal biomedical pretrained transformer (UMedPT) with a lightweight MLP head, and a logistic-regression variant using frozen UMedPT features (UMedPT_LR). Results: UMedPT_LR achieved the best EVI detection when axial and sagittal features were fused (AUC = 0.82; sensitivity = 0.75; F1 score = 0.73), surpassing the Chaimeleon Grand-Challenge winner (AUC = 0.74). The highest MFI performance was attained by UMedPT on axial harmonized images (AUC = 0.77), surpassing the Chaimeleon Grand-Challenge winner (AUC = 0.75). Frequency-domain harmonization improved MFI classification but variably affected EVI performance. Conventional CNNs (ResNet50, SeResNet) underperformed, especially in F1 score and balanced accuracy. Conclusion: These findings demonstrate that combining foundation model features, harmonization, and multi-view fusion significantly enhances diagnostic performance in rectal MRI.

163. 【2505.17971】Explainable Anatomy-Guided AI for Prostate MRI: Foundation Models and In Silico Clinical Trials for Virtual Biopsy-based Risk Assessment

链接https://arxiv.org/abs/2505.17971

作者:Danial Khan,Zohaib Salahuddin,Yumeng Zhang,Sheng Kuang,Shruti Atul Mali,Henry C. Woodruff,Sina Amirrajab,Rachel Cavill,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Adrian Galiana-Bordera,Paula Jimenez Gomez,Luis Marti-Bonmati,Philippe Lambin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:anatomically guided deep, guided deep learning, deep learning pipeline, UMedPT Swin Transformer, Swin Transformer foundation

备注

点击查看摘要

Abstract:We present a fully automated, anatomically guided deep learning pipeline for prostate cancer (PCa) risk stratification using routine MRI. The pipeline integrates three key components: an nnU-Net module for segmenting the prostate gland and its zones on axial T2-weighted MRI; a classification module based on the UMedPT Swin Transformer foundation model, fine-tuned on 3D patches with optional anatomical priors and clinical data; and a VAE-GAN framework for generating counterfactual heatmaps that localize decision-driving image regions. The system was developed using 1,500 PI-CAI cases for segmentation and 617 biparametric MRIs with metadata from the CHAIMELEON challenge for classification (split into 70% training, 10% validation, and 20% testing). Segmentation achieved mean Dice scores of 0.95 (gland), 0.94 (peripheral zone), and 0.92 (transition zone). Incorporating gland priors improved AUC from 0.69 to 0.72, with a three-scale ensemble achieving top performance (AUC = 0.79, composite score = 0.76), outperforming the 2024 CHAIMELEON challenge winners. Counterfactual heatmaps reliably highlighted lesions within segmented regions, enhancing model interpretability. In a prospective multi-center in-silico trial with 20 clinicians, AI assistance increased diagnostic accuracy from 0.72 to 0.77 and Cohen's kappa from 0.43 to 0.53, while reducing review time per case by 40%. These results demonstrate that anatomy-aware foundation models with counterfactual explainability can enable accurate, interpretable, and efficient PCa risk assessment, supporting their potential use as virtual biopsies in clinical practice.

164. 【2505.17915】Promptable cancer segmentation using minimal expert-curated data

链接https://arxiv.org/abs/2505.17915

作者:Lynn Karam,Yipei Wang,Veeru Kasivisvanathan,Mirabela Rusu,Yipeng Hu,Shaheer U. Saeed

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:aid targeted diagnostic, Automated segmentation, therapeutic procedures, aid targeted, targeted diagnostic

备注: Accepted at Medical Image Understanding and Analysis (MIUA) 2025

点击查看摘要

Abstract:Automated segmentation of cancer on medical images can aid targeted diagnostic and therapeutic procedures. However, its adoption is limited by the high cost of expert annotations required for training and inter-observer variability in datasets. While weakly-supervised methods mitigate some challenges, using binary histology labels for training as opposed to requiring full segmentation, they require large paired datasets of histology and images, which are difficult to curate. Similarly, promptable segmentation aims to allow segmentation with no re-training for new tasks at inference, however, existing models perform poorly on pathological regions, again necessitating large datasets for training. In this work we propose a novel approach for promptable segmentation requiring only 24 fully-segmented images, supplemented by 8 weakly-labelled images, for training. Curating this minimal data to a high standard is relatively feasible and thus issues with the cost and variability of obtaining labels can be mitigated. By leveraging two classifiers, one weakly-supervised and one fully-supervised, our method refines segmentation through a guided search process initiated by a single-point prompt. Our approach outperforms existing promptable segmentation methods, and performs comparably with fully-supervised methods, for the task of prostate cancer segmentation, while using substantially less annotated data (up to 100X less). This enables promptable segmentation with very minimal labelled data, such that the labels can be curated to a very high standard.

165. 【2505.17912】UltraBoneUDF: Self-supervised Bone Surface Reconstruction from Ultrasound Based on Neural Unsigned Distance Functions

链接https://arxiv.org/abs/2505.17912

作者:Luohong Wu,Matthias Seibold,Nicola A. Cavalcanti,Giuseppe Loggia,Lisa Reissner,Bastian Sigrist,Jonas Hein,Lilian Calvet,Arnd Viehöfer,Philipp Fürnstahl

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:computer-assisted orthopedic surgery, Bone surface reconstruction, Bone surface, surface reconstruction, Bone

备注

点击查看摘要

Abstract:Background: Bone surface reconstruction plays a critical role in computer-assisted orthopedic surgery. Compared to traditional imaging modalities such as CT and MRI, ultrasound offers a radiation-free, cost-effective, and portable alternative. Continuous bone surface reconstruction can be employed for many clinical applications. However, due to the inherent limitations of ultrasound imaging, B-mode ultrasound typically capture only partial bone surfaces. Existing reconstruction methods struggle with such incomplete data, leading to artifacts and increased reconstruction errors. Effective techniques for accurately reconstructing thin and open bone surfaces from real-world 3D ultrasound volumes remain lacking. Methods: We propose UltraBoneUDF, a self-supervised framework designed for reconstructing open bone surfaces from ultrasound using neural Unsigned Distance Functions. To enhance reconstruction quality, we introduce a novel global feature extractor that effectively fuses ultrasound-specific image characteristics. Additionally, we present a novel loss function based on local tangent plane optimization that substantially improves surface reconstruction quality. UltraBoneUDF and baseline models are extensively evaluated on four open-source datasets. Results: Qualitative results highlight the limitations of the state-of-the-art methods for open bone surface reconstruction and demonstrate the effectiveness of UltraBoneUDF. Quantitatively, UltraBoneUDF significantly outperforms competing methods across all evaluated datasets for both open and closed bone surface reconstruction in terms of mean Chamfer distance error: 1.10 mm on the UltraBones100k dataset (39.6\% improvement compared to the SOTA), 0.23 mm on the OpenBoneCT dataset (69.3\% improvement), 0.18 mm on the ClosedBoneCT dataset (70.2\% improvement), and 0.05 mm on the Prostate dataset (55.3\% improvement).

166. 【2505.17683】Dual Attention Residual U-Net for Accurate Brain Ultrasound Segmentation in IVH Detection

链接https://arxiv.org/abs/2505.17683

作者:Dan Yuan,Yi Feng,Ziyun Tang

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:severe neurological complication, Intraventricular hemorrhage, improve clinical outcomes, premature infants, necessitating early

备注: 10 pages,6 figures and 3 tables

点击查看摘要

Abstract:Intraventricular hemorrhage (IVH) is a severe neurological complication among premature infants, necessitating early and accurate detection from brain ultrasound (US) images to improve clinical outcomes. While recent deep learning methods offer promise for computer-aided diagnosis, challenges remain in capturing both local spatial details and global contextual dependencies critical for segmenting brain anatomies. In this work, we propose an enhanced Residual U-Net architecture incorporating two complementary attention mechanisms: the Convolutional Block Attention Module (CBAM) and a Sparse Attention Layer (SAL). The CBAM improves the model's ability to refine spatial and channel-wise features, while the SAL introduces a dual-branch design, sparse attention filters out low-confidence query-key pairs to suppress noise, and dense attention ensures comprehensive information propagation. Extensive experiments on the Brain US dataset demonstrate that our method achieves state-of-the-art segmentation performance, with a Dice score of 89.04% and IoU of 81.84% for ventricle region segmentation. These results highlight the effectiveness of integrating spatial refinement and attention sparsity for robust brain anatomy detection. Code is available at: this https URL.

167. 【2505.17644】owards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport

链接https://arxiv.org/abs/2505.17644

作者:Taoran Zheng,Xing Li,Yan Yang,Xiang Gu,Zongben Xu,Jian Sun

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging inverse problem, Medical image reconstruction, Medical image, inverse problem, vital but challenging

备注

点击查看摘要

Abstract:Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT's superior performance.

168. 【2505.17582】Distance Estimation in Outdoor Driving Environments Using Phase-only Correlation Method with Event Cameras

链接https://arxiv.org/abs/2505.17582

作者:Masataka Kobayashi(1),Shintaro Shiba(2),Quan Kong(2),Norimasa Kobori(2),Tsukasa Shimizu(3),Shan Lu(1),Takaya Yamazato(1) ((1) School of Engineering, Nagoya University, Nagoya, Japan, (2) Woven by Toyota, Inc., Tokyo, Japan, (3) Toyota Motor Corporation, Toyota, Japan)

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:reliable operation, growing adoption, technology is crucial, crucial for ensuring, ensuring safety

备注: 6 pages, 7 figures. To appear in IEEE Intelligent Vehicles Symposium (IV) 2025

点击查看摘要

Abstract:With the growing adoption of autonomous driving, the advancement of sensor technology is crucial for ensuring safety and reliable operation. Sensor fusion techniques that combine multiple sensors such as LiDAR, radar, and cameras have proven effective, but the integration of multiple devices increases both hardware complexity and cost. Therefore, developing a single sensor capable of performing multiple roles is highly desirable for cost-efficient and scalable autonomous driving systems. Event cameras have emerged as a promising solution due to their unique characteristics, including high dynamic range, low latency, and high temporal resolution. These features enable them to perform well in challenging lighting conditions, such as low-light or backlit environments. Moreover, their ability to detect fine-grained motion events makes them suitable for applications like pedestrian detection and vehicle-to-infrastructure communication via visible light. In this study, we present a method for distance estimation using a monocular event camera and a roadside LED bar. By applying a phase-only correlation technique to the event data, we achieve sub-pixel precision in detecting the spatial shift between two light sources. This enables accurate triangulation-based distance estimation without requiring stereo vision. Field experiments conducted in outdoor driving scenarios demonstrated that the proposed approach achieves over 90% success rate with less than 0.5-meter error for distances ranging from 20 to 60 meters. Future work includes extending this method to full position estimation by leveraging infrastructure such as smart poles equipped with LEDs, enabling event-camera-based vehicles to determine their own position in real time. This advancement could significantly enhance navigation accuracy, route optimization, and integration into intelligent transportation systems.

Comments:
6 pages, 7 figures. To appear in IEEE Intelligent Vehicles Symposium (IV) 2025

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

ACMclasses:
I.4.8; I.2.10; I.5.4

Cite as:
arXiv:2505.17582 [eess.IV]

(or
arXiv:2505.17582v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2505.17582

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Masataka Kobayashi [view email] [v1]
Fri, 23 May 2025 07:44:33 UTC (2,483 KB)

169. 【2505.17544】FreqU-FNet: Frequency-Aware U-Net for Imbalanced Medical Image Segmentation

链接https://arxiv.org/abs/2505.17544

作者:Ruiqi Xing

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:faces persistent challenges, persistent challenges due, severe class imbalance, image segmentation faces, segmentation faces persistent

备注: 15 pages, 1 figure

点击查看摘要

Abstract:Medical image segmentation faces persistent challenges due to severe class imbalance and the frequency-specific distribution of anatomical structures. Most conventional CNN-based methods operate in the spatial domain and struggle to capture minority class signals, often affected by frequency aliasing and limited spectral selectivity. Transformer-based models, while powerful in modeling global dependencies, tend to overlook critical local details necessary for fine-grained segmentation. To overcome these limitations, we propose FreqU-FNet, a novel U-shaped segmentation architecture operating in the frequency domain. Our framework incorporates a Frequency Encoder that leverages Low-Pass Frequency Convolution and Daubechies wavelet-based downsampling to extract multi-scale spectral features. To reconstruct fine spatial details, we introduce a Spatial Learnable Decoder (SLD) equipped with an adaptive multi-branch upsampling strategy. Furthermore, we design a frequency-aware loss (FAL) function to enhance minority class learning. Extensive experiments on multiple medical segmentation benchmarks demonstrate that FreqU-FNet consistently outperforms both CNN and Transformer baselines, particularly in handling under-represented classes, by effectively exploiting discriminative frequency bands.

170. 【2505.17528】DECT-based Space-Squeeze Method for Multi-Class Classification of Metastatic Lymph Nodes in Breast Cancer

链接https://arxiv.org/abs/2505.17528

作者:Hai Jiang,Chushan Zheng,Jiawei Pan,Yuanpin Zhou,Qiongting Liu,Xiang Zhang,Jun Shen,Yao Lu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:cancer treatment decisions, guiding breast cancer, breast cancer treatment, conventional imaging modalities, imaging modalities struggle

备注

点击查看摘要

Abstract:Background: Accurate assessment of metastatic burden in axillary lymph nodes is crucial for guiding breast cancer treatment decisions, yet conventional imaging modalities struggle to differentiate metastatic burden levels and capture comprehensive lymph node characteristics. This study leverages dual-energy computed tomography (DECT) to exploit spectral-spatial information for improved multi-class classification. Purpose: To develop a noninvasive DECT-based model classifying sentinel lymph nodes into three categories: no metastasis ($N_0$), low metastatic burden ($N_{+(1-2)}$), and heavy metastatic burden ($N_{+(\geq3)}$), thereby aiding therapeutic planning. Methods: We propose a novel space-squeeze method combining two innovations: (1) a channel-wise attention mechanism to compress and recalibrate spectral-spatial features across 11 energy levels, and (2) virtual class injection to sharpen inter-class boundaries and compact intra-class variations in the representation space. Results: Evaluated on 227 biopsy-confirmed cases, our method achieved an average test AUC of 0.86 (95% CI: 0.80-0.91) across three cross-validation folds, outperforming established CNNs (VGG, ResNet, etc). The channel-wise attention and virtual class components individually improved AUC by 5.01% and 5.87%, respectively, demonstrating complementary benefits. Conclusions: The proposed framework enhances diagnostic AUC by effectively integrating DECT's spectral-spatial data and mitigating class ambiguity, offering a promising tool for noninvasive metastatic burden assessment in clinical practice.

171. 【2505.17484】Anatomy-Guided Multitask Learning for MRI-Based Classification of Placenta Accreta Spectrum and its Subtypes

链接https://arxiv.org/abs/2505.17484

作者:Hai Jiang,Qiongting Liu,Yuanpin Zhou,Jiawei Pan,Ting Song,Yao Lu

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Accreta Spectrum Disorders, Placenta Accreta Spectrum, Spectrum Disorders, pose significant risks, bleeding severity correlating

备注

点击查看摘要

Abstract:Placenta Accreta Spectrum Disorders (PAS) pose significant risks during pregnancy, frequently leading to postpartum hemorrhage during cesarean deliveries and other severe clinical complications, with bleeding severity correlating to the degree of placental invasion. Consequently, accurate prenatal diagnosis of PAS and its subtypes-placenta accreta (PA), placenta increta (PI), and placenta percreta (PP)-is crucial. However, existing guidelines and methodologies predominantly focus on the presence of PAS, with limited research addressing subtype recognition. Additionally, previous multi-class diagnostic efforts have primarily relied on inefficient two-stage cascaded binary classification tasks. In this study, we propose a novel convolutional neural network (CNN) architecture designed for efficient one-stage multiclass diagnosis of PAS and its subtypes, based on 4,140 magnetic resonance imaging (MRI) slices. Our model features two branches: the main classification branch utilizes a residual block architecture comprising multiple residual blocks, while the second branch integrates anatomical features of the uteroplacental area and the adjacent uterine serous layer to enhance the model's attention during classification. Furthermore, we implement a multitask learning strategy to leverage both branches effectively. Experiments conducted on a real clinical dataset demonstrate that our model achieves state-of-the-art performance.

172. 【2505.17472】SUFFICIENT: A scan-specific unsupervised deep learning framework for high-resolution 3D isotropic fetal brain MRI reconstruction

链接https://arxiv.org/abs/2505.17472

作者:Jiangjie Wu,Lixuan Chen,Zhenghao Li,Xin Li,Saban Ozturk,Lihui Wang,Rongpin Wang,Hongjiang Wei,Yuyao Zhang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:clinical fetal MRI, High-quality, fetal brain MRI, volume, SVR

备注

点击查看摘要

Abstract:High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external training datasets, which are difficult to obtain for clinical fetal MRI. To address this issue, we propose an unsupervised iterative SVR-SRR framework for isotropic HR volume reconstruction. Specifically, SVR is formulated as a function mapping a 2D slice and a 3D target volume to a rigid transformation matrix, which aligns the slice to the underlying location in the target volume. The function is parameterized by a convolutional neural network, which is trained by minimizing the difference between the volume slicing at the predicted position and the input slice. In SRR, a decoding network embedded within a deep image prior framework is incorporated with a comprehensive image degradation model to produce the high-resolution (HR) volume. The deep image prior framework offers a local consistency prior to guide the reconstruction of HR volumes. By performing a forward degradation model, the HR volume is optimized by minimizing loss between predicted slices and the observed slices. Comprehensive experiments conducted on large-magnitude motion-corrupted simulation data and clinical data demonstrate the superior performance of the proposed framework over state-of-the-art fetal brain reconstruction frameworks.

173. 【2505.17210】Assessing the generalization performance of SAM for ureteroscopy scene understanding

链接https://arxiv.org/abs/2505.17210

作者:Martin Villagrana,Francisco Lopez-Tiro,Clement Larose,Gilberto Ochoa-Ruiz,Christian Daul

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:critical preliminary step, urinary stone types, types through machine, kidney stone segmentation, critical preliminary

备注: 15 pages, 4 figures, 2 tables, conference, MIUA25

点击查看摘要

Abstract:The segmentation of kidney stones is regarded as a critical preliminary step to enable the identification of urinary stone types through machine- or deep-learning-based approaches. In urology, manual segmentation is considered tedious and impractical due to the typically large scale of image databases and the continuous generation of new data. In this study, the potential of the Segment Anything Model (SAM) -- a state-of-the-art deep learning framework -- is investigated for the automation of kidney stone segmentation. The performance of SAM is evaluated in comparison to traditional models, including U-Net, Residual U-Net, and Attention U-Net, which, despite their efficiency, frequently exhibit limitations in generalizing to unseen datasets. The findings highlight SAM's superior adaptability and efficiency. While SAM achieves comparable performance to U-Net on in-distribution data (Accuracy: 97.68 + 3.04; Dice: 97.78 + 2.47; IoU: 95.76 + 4.18), it demonstrates significantly enhanced generalization capabilities on out-of-distribution data, surpassing all U-Net variants by margins of up to 23 percent.

174. 【2505.17096】AGS: 3D Tumor-Adaptive Guidance for SAM

链接https://arxiv.org/abs/2505.17096

作者:Sirui Li,Linkai Peng,Zheyuan Zhang,Gorkem Durak,Ulas Bagci

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:recently shown great, shown great promise, Foundation models, segmentation-remains underexplored, Tumor Adaptive Guidance

备注

点击查看摘要

Abstract:Foundation models (FMs) such as CLIP and SAM have recently shown great promise in image segmentation tasks, yet their adaptation to 3D medical imaging-particularly for pathology detection and segmentation-remains underexplored. A critical challenge arises from the domain gap between natural images and medical volumes: existing FMs, pre-trained on 2D data, struggle to capture 3D anatomical context, limiting their utility in clinical applications like tumor segmentation. To address this, we propose an adaptation framework called TAGS: Tumor Adaptive Guidance for SAM, which unlocks 2D FMs for 3D medical tasks through multi-prompt fusion. By preserving most of the pre-trained weights, our approach enhances SAM's spatial feature extraction using CLIP's semantic insights and anatomy-specific prompts. Extensive experiments on three open-source tumor segmentation datasets prove that our model surpasses the state-of-the-art medical image segmentation models (+46.88% over nnUNet), interactive segmentation frameworks, and other established medical FMs, including SAM-Med2D, SAM-Med3D, SegVol, Universal, 3D-Adapter, and SAM-B (at least +13% over them). This highlights the robustness and adaptability of our proposed framework across diverse medical segmentation tasks.