本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新501篇论文,其中:

  • 自然语言处理76
  • 信息检索17
  • 计算机视觉104

自然语言处理

1. 【2510.15870】OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

链接https://arxiv.org/abs/2510.15870

作者:Hanrong Ye,Chao-Han Huck Yang,Arushi Goel,Wei Huang,Ligeng Zhu,Yuanhang Su,Sean Lin,An-Chieh Cheng,Zhen Wan,Jinchuan Tian,Yuming Lou,Dong Yang,Zhijian Liu,Yukang Chen,Ambrish Dantrey,Ehsan Jahangiri,Sreyan Ghosh,Daguang Xu,Ehsan Hosseini-Asl,Danial Mohseni Taheri,Vidya Murali,Sifei Liu,Jason Lu,Oluwatobi Olabiyi,Frank Wang,Rafael Valle,Bryan Catanzaro,Andrew Tao,Song Han,Jan Kautz,Hongxu Yin,Pavlo Molchanov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Advancing machine intelligence, machine intelligence requires, intelligence requires developing, Advancing machine, sense the world

备注: Technical Report. Code: [this https URL](https://github.com/NVlabs/OmniVinci)

点击查看摘要

Abstract:Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

2. 【2510.15863】PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

链接https://arxiv.org/abs/2510.15863

作者:Simon Yu,Gang Li,Weiyan Shi,Peng Qi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, language models, moving beyond static, interaction with external

备注: 29 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent's ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously.

3. 【2510.15859】InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

链接https://arxiv.org/abs/2510.15859

作者:Pengkai Wang,Qi Zuo,Pengwei Liu,Zhijie Sang,Congkai Xie,Hongxia Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, shown substantial advances, Language Models, programmatically verified

备注: 17 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.

4. 【2510.15851】SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

链接https://arxiv.org/abs/2510.15851

作者:Kadri Hacioglu,Manjunath K E,Andreas Stolcke

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:spoken language understanding, natural language understanding, traditionally implemented, speech understanding tasks, language understanding

备注: 13 pages, EMNLP 2025

点击查看摘要

Abstract:Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.

5. 【2510.15843】Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework

链接https://arxiv.org/abs/2510.15843

作者:Shayan Rokhva,Mousa Alizadeh,Maryam Abdollahi Shamami

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Accurately detecting sentiment, social media posts, media posts remains, posts remains challenging, remains challenging due

备注

点击查看摘要

Abstract:Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.

6. 【2510.15842】Paper2Web: Let's Make Your Paper Alive!

链接https://arxiv.org/abs/2510.15842

作者:Yuhang Chen,Tianpeng Lv,Siyi Zhang,Yixiang Yin,Yao Wan,Philip S. Yu,Dongping Chen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:effectively disseminate research, enable intuitive navigation, Academic project websites, Large Language Model, academic webpage generation

备注: Under Review. Check [this https URL](https://github.com/YuhangChen1/Paper2All) for the unified platform to streamline all academic presentation

点击查看摘要

Abstract:Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.

7. 【2510.15804】Emergence of Linear Truth Encodings in Language Models

链接https://arxiv.org/abs/2510.15804

作者:Shauli Ravfogel,Gilad Yehudai,Tal Linzen,Joan Bruna,Alberto Bietti

类目:Computation and Language (cs.CL)

关键词:Recent probing studies, probing studies reveal, Recent probing, emergence is unclear, exhibit linear subspaces

备注: Accepted in Neurips 2025

点击查看摘要

Abstract:Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

8. 【2510.15768】On Non-interactive Evaluation of Animal Communication Translators

链接https://arxiv.org/abs/2510.15768

作者:Orr Paradise,David F. Gruber,Adam Tauman Kalai

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Abstract, translations, evaluate translators, evaluate, working

备注

点击查看摘要

Abstract:If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,'' false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.

9. 【2510.15746】LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

链接https://arxiv.org/abs/2510.15746

作者:Gao Yang,Yuhang Liu,Siyu Miao,Xinyue Liang,Zhengyang Liu,Heyan Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, Ideal or real, http URL, language models, explore whether principles

备注

点击查看摘要

Abstract:Ideal or real - that is the this http URL this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

10. 【2510.15731】Attention Sinks in Diffusion Language Models

链接https://arxiv.org/abs/2510.15731

作者:Maximo Eduardo Rulli,Simone Petruzzi,Edoardo Michielon,Fabrizio Silvestri,Simone Scardapane,Alessio Devoto

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Masked Diffusion Language, Masked Diffusion, Diffusion Language Models, Diffusion Language, traditional Autoregressive Models

备注

点击查看摘要

Abstract:Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.

11. 【2510.15719】Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

链接https://arxiv.org/abs/2510.15719

作者:Helia Hashemi,Victor Rühle,Saravan Rajmohan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:gained significant attention, significant attention due, strong performance, attention due, Reasoning models

备注

点击查看摘要

Abstract:Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.

12. 【2510.15706】GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery

链接https://arxiv.org/abs/2510.15706

作者:Italo Luis da Silva,Hanqi Yan,Lin Gui,Yulan He

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, text generation capabilities, show strong reasoning

备注: 9 pages, 6 figures, 3 tables, EMNLP 2025 Demo paper

点击查看摘要

Abstract:Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have. While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce $\textbf{GraphMind}$, an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, $\textbf{GraphMind}$ enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. $\textbf{GraphMind}$ enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea's core contributions and its connections to existing work. $\textbf{GraphMind}$ is available at this https URL and a demonstration video at this https URL. The source code is available at this https URL.

13. 【2510.15685】Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

链接https://arxiv.org/abs/2510.15685

作者:Joshua Wolfe Brook,Ilia Markov

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Hate Speech Detection, Language Models, dynamic knowledge bases, Large Language

备注: 8 pages, 9 figures, submitted to LREC 2026

点击查看摘要

Abstract:This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.

14. 【2510.15682】SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.15682

作者:Ines Besrour,Jingbo He,Tobias Schreieder,Michael Färber

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, https URL, multi-agent retrieval-augmented generation, retrieval-augmented generation, language models

备注: Accepted at CIKM 2025

点击查看摘要

Abstract:We present SQuAI (this https URL), a scalable and trustworthy multi-agent retrieval-augmented generation (RAG) framework for scientific question answering (QA) with large language models (LLMs). SQuAI addresses key limitations of existing RAG systems in the scholarly domain, where complex, open-domain questions demand accurate answers, explicit claims with citations, and retrieval across millions of scientific documents. Built on over 2.3 million full-text papers from arXiv.org, SQuAI employs four collaborative agents to decompose complex questions into sub-questions, retrieve targeted evidence via hybrid sparse-dense retrieval, and adaptively filter documents to improve contextual relevance. To ensure faithfulness and traceability, SQuAI integrates in-line citations for each generated claim and provides supporting sentences from the source documents. Our system improves faithfulness, answer relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG baseline. We further release a benchmark of 1,000 scientific question-answer-evidence triplets to support reproducibility. With transparent reasoning, verifiable citations, and domain-wide scalability, SQuAI demonstrates how multi-agent RAG enables more trustworthy scientific QA with LLMs.

15. 【2510.15624】Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

链接https://arxiv.org/abs/2510.15624

作者:Ed Li,Junyu Ren,Xintian Pan,Cat Yan,Chuanhao Li,Dirk Bergemann,Zhuoran Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Artificial Intelligence, milestone in Artificial, textit, scientific discovery represents, discovery represents

备注: 37 pages, 5 figures. Code: [this https URL](https://github.com/ltjed/freephdlabor)

点击查看摘要

Abstract:The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present \texttt{freephdlabor}, an open-source multiagent framework featuring \textit{fully dynamic workflows} determined by real-time agent reasoning and a \coloremph{\textit{modular architecture}} enabling seamless customization -- users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including \textit{automatic context compaction}, \textit{workspace-based communication} to prevent information degradation, \textit{memory persistence} across sessions, and \textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into \textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research -- from ideation through experimentation to publication-ready manuscripts.

16. 【2510.15614】HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

链接https://arxiv.org/abs/2510.15614

作者:Tingting Chen,Beibei Lin,Zifeng Yuan,Qiran Zou,Hongyu He,Yew-Soon Ong,Anirudh Goyal,Dianbo Liu

类目:Computation and Language (cs.CL)

关键词:correct answer-becomes critical, single correct answer-becomes, evaluating their ability, answer-becomes critical, ability to propose

备注

点击查看摘要

Abstract:As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: this https URL.

17. 【2510.15600】Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism

链接https://arxiv.org/abs/2510.15600

作者:Haoran Sun,Yankai Jiang,Zhenyu Tang,Yaning Pan,Shuang Gu,Zekai Lin,Lilong Wang,Wenjie Lou,Lei Liu,Lei Bai,Xiaosong Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:reproducible science lies, logically ordered, foundation of reproducible, reproducible science, science lies

备注

点击查看摘要

Abstract:The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the "Sketch-and-Fill" paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.

18. 【2510.15594】he Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

链接https://arxiv.org/abs/2510.15594

作者:Antoine Bourgois,Thierry Poibeau

类目:Computation and Language (cs.CL)

关键词:computational literature researchers, remain surprisingly scarce, documents remain surprisingly, literature researchers, surprisingly scarce

备注

点击查看摘要

Abstract:While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

19. 【2510.15585】Leveraging Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework

链接https://arxiv.org/abs/2510.15585

作者:Dr Simon Thorne,Dr Advait Sarkar

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词:Large Language Models, traditional software code, Large Language, increasingly leveraged, leveraged for generating

备注: 16 pages

点击查看摘要

Abstract:Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues such as hallucinations, subtle logical inconsistencies, and syntactic errors, risks particularly acute in high stakes domains like financial modelling and scientific computations, where accuracy and reliability are paramount. This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation to enhance the correctness of, reliability of, and user confidence in generated outputs. We hypothesise that a "test first" methodology provides both technical constraints and cognitive scaffolding, guiding LLM outputs towards more accurate, verifiable, and comprehensible solutions. Our framework, applicable across diverse programming contexts, from spreadsheet formula generation to scripting languages such as Python and strongly typed languages like Rust, includes an explicitly outlined experimental design with clearly defined participant groups, evaluation metrics, and illustrative TDD based prompting examples. By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement, particularly benefiting spreadsheet users who often lack formal programming training yet face serious consequences from logical errors. We invite collaboration to refine and empirically evaluate this approach, ultimately aiming to establish responsible and reliable LLM integration in both educational and professional development practices.

20. 【2510.15577】BiMax: Bidirectional MaxSim Score for Document-Level Alignment

链接https://arxiv.org/abs/2510.15577

作者:Xiaotian Wang,Takehito Utsuro,Masaaki Nagata

类目:Computation and Language (cs.CL)

关键词:source and target, target languages, Thompson and Koehn, Optimal Transport, web domain

备注: accepted at Findings of EMNLP2025

点击查看摘要

Abstract:Document alignment is necessary for the hierarchical mining (Bañón et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzmán, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (this https URL).

21. 【2510.15569】From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages

链接https://arxiv.org/abs/2510.15569

作者:Syed Mohammad Sualeh Ali

类目:Computation and Language (cs.CL)

关键词:Urdu poetry, exploring its thematic, lens of polysemy, paper delves, intricate world

备注

点击查看摘要

Abstract:This paper delves into the intricate world of Urdu poetry, exploring its thematic depths through a lens of polysemy. By focusing on the nuanced differences between three seemingly synonymous words (pyaar, muhabbat, and ishq) we expose a spectrum of emotions and experiences unique to the Urdu language. This study employs a polysemic case study approach, meticulously examining how these words are interwoven within the rich tapestry of Urdu poetry. By analyzing their usage and context, we uncover a hidden layer of meaning, revealing subtle distinctions which lack direct equivalents in English literature. Furthermore, we embark on a comparative analysis, generating word embeddings for both Urdu and English terms related to love. This enables us to quantify and visualize the semantic space occupied by these words, providing valuable insights into the cultural and linguistic nuances of expressing love. Through this multifaceted approach, our study sheds light on the captivating complexities of Urdu poetry, offering a deeper understanding and appreciation for its unique portrayal of love and its myriad expressions

22. 【2510.15561】Finetuning LLMs for EvaCun 2025 token prediction shared task

链接https://arxiv.org/abs/2510.15561

作者:Josef Jon,Ondřej Bojar

类目:Computation and Language (cs.CL)

关键词:token prediction task, present our submission, Aya Expanse, task data provided, token prediction

备注

点击查看摘要

Abstract:In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

23. 【2510.15558】KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

链接https://arxiv.org/abs/2510.15558

作者:Dongjun Kim,Chanhee Park,Chanjun Park,Heuiseok Lim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:complex reasoning systems, large language models, numerous applications, pivotal for numerous, conversational agents

备注: 13 pages, 3 figures, 5 tables

点击查看摘要

Abstract:The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.

24. 【2510.15552】hink Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.15552

作者:Jinliang Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, language models, language understanding, hallucinate and struggle

备注

点击查看摘要

Abstract:Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.

25. 【2510.15551】Rethinking Cross-lingual Gaps from a Statistical Viewpoint

链接https://arxiv.org/abs/2510.15551

作者:Vihari Piratla,Purvam Jain,Darshan Singh,Partha Talukdar,Trevor Cohn

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, cross-lingual gap, handful of natural, large corpus, Language

备注: 22 pages

点击查看摘要

Abstract:Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.

26. 【2510.15545】okenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

链接https://arxiv.org/abs/2510.15545

作者:Sibo Xiao,Jinyuan Fu,Zhongle Xie,Lidan Shou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, LLM inference efficiency, large language, critical challenge, challenge in generative

备注

点击查看摘要

Abstract:Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

27. 【2510.15543】MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

链接https://arxiv.org/abs/2510.15543

作者:Qiyu Wu,Shuyang Cui,Satoshi Hayakawa,Wei-Yao Wang,Hiromi Wakaki,Yuki Mitsufuji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:retrieve relevant content, contents production, text or image, supports applications, relevant content

备注

点击查看摘要

Abstract:Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

28. 【2510.15522】Latent Reasoning in LLMs as a Vocabulary-Space Superposition

链接https://arxiv.org/abs/2510.15522

作者:Jingcheng Deng,Liang Pang,Zihao Wei,Shichen Xu,Zenghao Duan,Kun Xu,Yang Song,Huawei Shen,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:Large language models, substantial computational overhead, introduces substantial computational, Large language, latent

备注

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.

29. 【2510.15517】From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

链接https://arxiv.org/abs/2510.15517

作者:Rares Dolga,Lucas Maystre,Tudor Berariu,David Barber

类目:Computation and Language (cs.CL)

关键词:Byte Pair Encoding, Pair Encoding, Byte Pair, Subword tokenization methods, Subword tokenization

备注

点击查看摘要

Abstract:Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.

30. 【2510.15513】mporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

链接https://arxiv.org/abs/2510.15513

作者:Ashutosh Bajpai,Tanmoy Chakraborty

类目:Computation and Language (cs.CL)

关键词:knowledge sources marks, significant paradigm shift, large language models, increasing acceptance, acceptance of large

备注: EMNLP Main Long Paper 2025

点击查看摘要

Abstract:The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.

31. 【2510.15501】DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

链接https://arxiv.org/abs/2510.15501

作者:Yao Huang,Yitong Sun,Yichi Zhang,Ruochen Zhang,Yinpeng Dong,Xingxing Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:diverse cognitive tasks, induce severe risks, Large Language Models, introduces emergent deceptive, Large Language

备注: 28 pages, 17 figures, accepted by NeruIPS 2025

点击查看摘要

Abstract:Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at this https URL.

32. 【2510.15455】CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

链接https://arxiv.org/abs/2510.15455

作者:Gucongcong Fan,Chaoyue Niu,Chengfei Lyu,Fan Wu,Guihai Chen

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, smartphone user interfaces, rely on Large

备注

点击查看摘要

Abstract:Mobile agents rely on Large Language Models (LLMs) to plan and execute tasks on smartphone user interfaces (UIs). While cloud-based LLMs achieve high task accuracy, they require uploading the full UI state at every step, exposing unnecessary and often irrelevant information. In contrast, local LLMs avoid UI uploads but suffer from limited capacity, resulting in lower task success rates. We propose $\textbf{CORE}$, a $\textbf{CO}$llaborative framework that combines the strengths of cloud and local LLMs to $\textbf{R}$educe UI $\textbf{E}$xposure, while maintaining task accuracy for mobile agents. CORE comprises three key components: (1) $\textbf{Layout-aware block partitioning}$, which groups semantically related UI elements based on the XML screen hierarchy; (2) $\textbf{Co-planning}$, where local and cloud LLMs collaboratively identify the current sub-task; and (3) $\textbf{Co-decision-making}$, where the local LLM ranks relevant UI blocks, and the cloud LLM selects specific UI elements within the top-ranked block. CORE further introduces a multi-round accumulation mechanism to mitigate local misjudgment or limited context. Experiments across diverse mobile apps and tasks show that CORE reduces UI exposure by up to 55.6% while maintaining task success rates slightly below cloud-only agents, effectively mitigating unnecessary privacy exposure to the cloud. The code is available at this https URL.

33. 【2510.15436】Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering

链接https://arxiv.org/abs/2510.15436

作者:Xiangchen Song,Yuchen Liu,Yaxuan Luan,Jinxu Guo,Xiaofan Guo

类目:Computation and Language (cs.CL)

关键词:controllable abstract summary, controllable abstract, study presents, presents a controllable, abstract summary generation

备注

点击查看摘要

Abstract:This study presents a controllable abstract summary generation method for large language models based on prompt engineering. To address the issues of summary quality and controllability in traditional methods, we design a multi-stage prompt generation framework. This framework generates summaries with varying levels of abstraction by performing semantic analysis, topic modeling, and noise control on the input text. The experiment uses the CNN/Daily Mail dataset and provides a detailed analysis of different prompt lengths, data noise, and text types. The experimental results show that prompt length has a significant impact on the quality of generated summaries. Both very short and very long prompt tokens result in a decrease in summary quality. Data noise also negatively affects the summary generation process. As noise levels increase, the ROUGE-L score gradually decreases. Furthermore, different text types have varying effects on the model's ability to generate summaries. The model performs best when handling news texts, while its performance is worse when processing academic articles. This research provides new insights into improving summary generation using large language models, particularly in how controlling prompt strategies and optimizing text preprocessing can enhance summary accuracy and controllability.

34. 【2510.15421】When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

链接https://arxiv.org/abs/2510.15421

作者:Hongcheng Liu,Pingjie Wang,Yuhao Wang,Siqu Ou,Yanfeng Wang,Yu Wang

类目:Computation and Language (cs.CL)

关键词:shown strong capabilities, large language models, Multimodal large language, large language, shown strong

备注: 20 pages, 13 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.

35. 【2510.15418】Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

链接https://arxiv.org/abs/2510.15418

作者:Lee Qi Zun,Mohamad Zulhilmi Bin Abdul Halim,Goh Man Fye

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Clinical Practice Guidelines, Malaysian Clinical Practice, Practice Guidelines, Retrieval-Augmented Generation systems, providing fact-based guidance

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

36. 【2510.15412】Large-scale User Game Lifecycle Representation Learning

链接https://arxiv.org/abs/2510.15412

作者:Yanjie Gou,Jiangming Liu,Kouying Xue,Yi Hua

类目:Computation and Language (cs.CL)

关键词:video game production, game production necessitates, rapid expansion, expansion of video, production necessitates

备注

点击查看摘要

Abstract:The rapid expansion of video game production necessitates the development of effective advertising and recommendation systems for online game platforms. Recommending and advertising games to users hinges on capturing their interest in games. However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.

37. 【2510.15406】VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

链接https://arxiv.org/abs/2510.15406

作者:Hongcheng Liu,Yixuan Hou,Heyang Liu,Yuhao Wang,Yanfeng Wang,Yu Wang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Speech Large Language, Language Models, Large Language, show strong performance

备注: 21 pages, 4 figures

点击查看摘要

Abstract:While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs

38. 【2510.15349】Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

链接https://arxiv.org/abs/2510.15349

作者:Baode Wang,Biao Wu,Weizhen Li,Meng Fang,Zuming Huang,Jun Huang,Haozhe Wang,Yanjie Liang,Ling Chen,Wei Chu,Yuan Qi

类目:Computation and Language (cs.CL)

关键词:structured formats remains, complexly intertwined elements, significant challenge due, scanned images, images into structured

备注: 22 pages, 14 figures,

点击查看摘要

Abstract:Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.

39. 【2510.15346】When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

链接https://arxiv.org/abs/2510.15346

作者:Heecheol Yun,Kwangmin Ki,Junghyun Lee,Eunho Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Ensembling Large Language, Large Language, Language Models, Ensembling Large

备注: preprint

点击查看摘要

Abstract:Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.

40. 【2510.15345】Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

链接https://arxiv.org/abs/2510.15345

作者:Catarina G Belem,Parker Glenn,Alfy Samuel,Anoop Kumar,Daben Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automatic readability assessment, accessible written communication, readability assessment plays, Automatic readability, written communication

备注: Accepted at the TSAR Workshop @ EMNLP 2025

点击查看摘要

Abstract:Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

41. 【2510.15339】AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

链接https://arxiv.org/abs/2510.15339

作者:Hong Ting Tsang,Jiaxin Bai,Haoyu Huang,Qiao Xiao,Tianshi Zheng,Baixuan Xu,Shujie Liu,Yangqiu Song

类目:Computation and Language (cs.CL)

关键词:advancing question answering, Building effective knowledge, question answering, pivotal for advancing, advancing question

备注

点击查看摘要

Abstract:Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph's functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically ``good'' graphs to building demonstrably ``useful'' ones.

42. 【2510.15330】BeLLMan: Controlling LLM Congestion

链接https://arxiv.org/abs/2510.15330

作者:Tella Rajashekhar Reddy,Atharva Deshmukh,Karan Tandon,Rohan Gandhi,Anjaly Parayil,Debopam Bhattacherjee

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI)

关键词:Large language model, generate tokens autoregressively, poor user experience, Large language, changing system load

备注: To be presented at FAISYS 2025

点击查看摘要

Abstract:Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.

43. 【2510.15313】Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

链接https://arxiv.org/abs/2510.15313

作者:Bolei Ma,Yina Yao,Anna-Carolina Haensch

类目:Computation and Language (cs.CL)

关键词:Large Language Models, classical Chinese poetry, remains poorly understood, Large Language, Chinese poetry generation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit "echo chamber" effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

44. 【2510.15312】Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

链接https://arxiv.org/abs/2510.15312

作者:Zhiyang Chen,Daliang Xu,Haiyang Shen,Mengwei Xu,Shangguang Wang,Yun Ma

类目:Computation and Language (cs.CL)

关键词:Enhancing on-device large, large language models, on-device large language, local data enables, data enables personalized

备注

点击查看摘要

Abstract:Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents this http URL, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

45. 【2510.15311】Automatic essay scoring: leveraging Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach

链接https://arxiv.org/abs/2510.15311

作者:Andharini Dwi Cahyani,Moh. Wildan Fathoni,Fika Hastarita Rachman,Ari Basuki,Salman Amin,Bain Khusnul Khotimah

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE)

关键词:Automated essay scoring, evaluating written content, accurate assessment tools, Automated essay, written content

备注

点击查看摘要

Abstract:Automated essay scoring (AES) is a vital area of research aiming to provide efficient and accurate assessment tools for evaluating written content. This study investigates the effectiveness of two popular similarity metrics, Jaccard coefficient, and Cosine similarity, within the context of vector space models(VSM)employing unigram, bigram, and trigram representations. The data used in this research was obtained from the formative essay of the citizenship education subject in a junior high school. Each essay undergoes preprocessing to extract features using n-gram models, followed by vectorization to transform text data into numerical representations. Then, similarity scores are computed between essays using both Jaccard coefficient and Cosine similarity. The performance of the system is evaluated by analyzing the root mean square error (RMSE), which measures the difference between the scores given by human graders and those generated by the system. The result shows that the Cosine similarity outperformed the Jaccard coefficient. In terms of n-gram, unigrams have lower RMSE compared to bigrams and trigrams.

46. 【2510.15283】Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

链接https://arxiv.org/abs/2510.15283

作者:Jingao Xu,Shuoyoucheng Ma,Xin Song,Rong Jiang,Hongkui Tu,Bin Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, structured knowledge graph, Graph Question Answering, Knowledge Graph Question, Knowledge Graph

备注

点击查看摘要

Abstract:Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM's planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.

47. 【2510.15269】ACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding

链接https://arxiv.org/abs/2510.15269

作者:Mucheng Ren,Yucheng Yan,He Chen,Danqing Hu,Jun Xu,Xian Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:capturing critical information, capturing critical, patient care, cornerstone of modern, critical information

备注: Accepted as BIBM 2025 Regular. 8 pages. Pre-CR version

点击查看摘要

Abstract:Medical texts, particularly electronic medical records (EMRs), are a cornerstone of modern healthcare, capturing critical information about patient care, diagnoses, and treatments. These texts hold immense potential for advancing clinical decision-making and healthcare analytics. However, their unstructured nature, domain-specific language, and variability across contexts make automated understanding an intricate challenge. Despite the advancements in natural language processing, existing methods often treat all data as equally challenging, ignoring the inherent differences in complexity across clinical records. This oversight limits the ability of models to effectively generalize and perform well on rare or complex cases. In this paper, we present TACL (Threshold-Adaptive Curriculum Learning), a novel framework designed to address these challenges by rethinking how models interact with medical texts during training. Inspired by the principle of progressive learning, TACL dynamically adjusts the training process based on the complexity of individual samples. By categorizing data into difficulty levels and prioritizing simpler cases early in training, the model builds a strong foundation before tackling more complex records. By applying TACL to multilingual medical data, including English and Chinese clinical records, we observe significant improvements across diverse clinical tasks, including automatic ICD coding, readmission prediction and TCM syndrome differentiation. TACL not only enhances the performance of automated systems but also demonstrates the potential to unify approaches across disparate medical domains, paving the way for more accurate, scalable, and globally applicable medical text understanding solutions.

48. 【2510.15267】raceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

链接https://arxiv.org/abs/2510.15267

作者:Mucheng Ren,He Chen,Yuchen Yan,Danqing Hu,Jun Xu,Xian Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automated International Classification, Classification of Diseases, International Classification, assigns standardized diagnosis, coding assigns standardized

备注: Accpeted as BIBM 2025 Regular.8 [this http URL](http://pages.Pre) -CR version

点击查看摘要

Abstract:Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.

49. 【2510.15260】DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models

链接https://arxiv.org/abs/2510.15260

作者:Yangyang Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, language models, models are highly, highly sensitive

备注: Preprint. Under review at ICLR 2026. 11 pages, 2 figures

点击查看摘要

Abstract:Large language models are highly sensitive to prompt wording. However, popular automatic prompt search methods, including InstructZero, often degrade under distribution shift and adversarial evaluation because they optimize expected performance under a single evaluation distribution. Consequently, prompts that work in one setting frequently fail to transfer. To address this, DRO-InstructZero formulates zero-shot prompt optimization as robust Bayesian optimization. Specifically, an f-divergence ball defines an ambiguity set around the evaluation distribution, and a robust acquisition rule maximizes worst-case expected utility while retaining the query efficiency of Bayesian search. Therefore, the search explicitly targets reliability under distribution shift rather than average behavior alone. Experiments follow the instruction-induction protocol with matched query budgets across formality rewriting, code debugging, and translation. For example, on BIG-Bench informative-to-formal rewriting, accuracy improves from 61.3 +/- 0.7% to approximately 85-90%, yielding an absolute gain of about 25-30 points. Moreover, auto-debugging shows about +25-point gains under domain shift. Meanwhile, stable tasks such as cause-and-effect remain above 96%, indicating no loss on in-distribution cases. Furthermore, improvements are consistent across divergence choices and decoding temperatures. Overall, DRO-InstructZero connects distributionally robust optimization with prompt learning, offering a plug-and-play and general approach for reliable, transferable prompt alignment under real-world uncertainty.

50. 【2510.15258】Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions

链接https://arxiv.org/abs/2510.15258

作者:Xi Wang,Xianyao Ling,Kun Li,Gang Yin,Liang Zhang,Jiang Wu,Jun Xu,Fu Zhang,Wenbo Lei,Annie Wang,Peng Gong

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:extracting deep insights, Large Language Models, insights from massive, current era, era of big

备注: 14 pages, 7 figures, 40 references

点击查看摘要

Abstract:In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from "hallucination" issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.

51. 【2510.15253】Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

链接https://arxiv.org/abs/2510.15253

作者:Sensen Gao,Shanshan Zhao,Xu Jiang,Lunhao Duan,Yong Xien Chng,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,Jia-Wang Bian,Mingming Gong

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, feeding Large Language, scientific discovery, financial analysis, analysis to scientific

备注

点击查看摘要

Abstract:Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

52. 【2510.15244】Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

链接https://arxiv.org/abs/2510.15244

作者:Lina Berrayana,Ahmed Heakl,Muhammad Abdullah Sohail,Thomas Hofmann,Salman Khan,Wei Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Current autoregressive language, Current autoregressive, long token sequences, achieve high accuracy, require long token

备注: Under Submission

点击查看摘要

Abstract:Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM -- ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.

53. 【2510.15232】FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

链接https://arxiv.org/abs/2510.15232

作者:Tiansheng Hu,Tongyan Hu,Liuyang Bai,Yilun Zhao,Arman Cohan,Chen Zhao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:demonstrated promising ability, finance related problems, solving finance related, Recent LLMs, related problems

备注: EMNLP 2025 Main

点击查看摘要

Abstract:Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs' trustworthiness evaluation in finance domain.

54. 【2510.15231】Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

链接https://arxiv.org/abs/2510.15231

作者:Yuatyong Chaichana,Pittawat Taveekitworachai,Warit Sirichotedumrong,Potsawee Manakul,Kunat Pipatanakul

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Large Audio-Language Models, Large Audio-Language, text backbones support, limiting long-form audio, backbones support long

备注

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.

55. 【2510.15216】Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

链接https://arxiv.org/abs/2510.15216

作者:Xuansheng Wu,Xiaoman Pan,Wenlin Yao,Jianshu Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:RLVR varies dramatically, Reinforcement learning, RLVR varies, large language models, elicit strong reasoning

备注: Pre-print

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.

56. 【2510.15191】Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

链接https://arxiv.org/abs/2510.15191

作者:Junlin Wu,Xianrui Zhong,Jiashuo Sun,Bolian Li,Bowen Jin,Jiawei Han,Qingkai Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, demonstrated remarkable advances, Large language, demonstrated remarkable, remarkable advances

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: this https URL.

57. 【2510.15186】MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation

链接https://arxiv.org/abs/2510.15186

作者:Gurusha Juneja,Jayanth Naga Sai Pasupulati,Alon Albalak,Wenyue Hua,William Yang Wang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:alongside task efficacy, autonomous LLM agents, core challenge, challenge for autonomous, settings is balancing

备注

点击查看摘要

Abstract:A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single-turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi-AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high-stakes tasks designed to evaluate privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state-of-the-art agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage, with Gemini 2.5-Pro leaking up to 50.7% and GPT-5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power-seeking (e.g., Gemini 2.5-Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.

58. 【2510.15162】rain a Unified Multimodal Data Quality Classifier with Synthetic Data

链接https://arxiv.org/abs/2510.15162

作者:Weizhi Wang,Rongmei Lin,Shiyang Li,Colin Lockard,Ritesh Sarkhel,Sanket Lokegaonkar,Jingbo Shang,Xifeng Yan,Nasser Zalmout,Xian Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, interleaved document data, Unified Mulitmodal Data

备注: EMNLP 2025 Findings

点击查看摘要

Abstract:The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

59. 【2510.15144】HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

链接https://arxiv.org/abs/2510.15144

作者:Chance Jiajie Li,Zhenze Mo,Yuhan Tang,Ao Qu,Jiayi Wu,Kaiya Ivy Zhao,Yulu Gan,Jie Fan,Jiangbo Yu,Hang Jiang,Paul Pu Liang,Jinhua Zhao,Luis Alberto Alonso Pastor,Kent Larson

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Simulating human reasoning, Simulating human, cognitive science, long-standing aspiration, Simulating

备注: To appear in NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models (LAW)

点击查看摘要

Abstract:Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (this https URL) and TraceYourThinking (this https URL).

60. 【2510.15134】FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

链接https://arxiv.org/abs/2510.15134

作者:Mohammad Heydari Rad,Rezvan Afari,Saeedeh Momtazi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multiple-choice questions, evaluating learners' knowledge, educational testing, offer an efficient, evaluating learners'

备注

点击查看摘要

Abstract:Multiple-choice questions (MCQs) are commonly used in educational testing, as they offer an efficient means of evaluating learners' knowledge. However, generating high-quality MCQs, particularly in low-resource languages such as Persian, remains a significant challenge. This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language MCQs. Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs. We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers. Our work is based on data from Wikipedia, which includes general knowledge questions. Furthermore, this study introduces a novel Persian MCQ dataset comprising 10,289 questions. This dataset is evaluated by different state-of-the-art large language models (LLMs). Our results demonstrate the effectiveness of our model and the quality of the generated dataset, which has the potential to inspire further research on MCQs.

61. 【2510.15125】Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

链接https://arxiv.org/abs/2510.15125

作者:Alexander Brady,Tunazzina Islam

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:rapidly evolving content, evolving content remains, media platforms play, major challenge, platforms play

备注: Under-submission

点击查看摘要

Abstract:Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges--abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.

62. 【2510.15115】Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

链接https://arxiv.org/abs/2510.15115

作者:Kirill Semenov,Rico Sennrich

类目:Computation and Language (cs.CL)

关键词:named entities inserted, factual knowledge assessment, knowledge retrieval scores, account the grammatical, grammatical and semantic

备注

点击查看摘要

Abstract:For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to numerous instances of ungrammaticality or wrong wording of the final prompts, which complicates the interpretation of scores, especially for languages that have a rich morphological inventory. In this work, we sample 4 Slavic languages from the MLAMA dataset and compare the knowledge retrieval scores between the initial (templated) MLAMA dataset and its sentence-level translations made by Google Translate and ChatGPT. We observe a significant increase in knowledge retrieval scores, and provide a qualitative analysis for possible reasons behind it. We also make an additional analysis of 5 more languages from different families and see similar patterns. Therefore, we encourage the community to control the grammaticality of highly multilingual datasets for higher and more interpretable results, which is well approximated by whole sentence translation with neural MT or LLM systems. The dataset and all related code is published at the Github repository: this https URL.

63. 【2510.15110】DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

链接https://arxiv.org/abs/2510.15110

作者:Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Qwen achieve strong, unnecessarily long outputs, achieve strong performance, generate unnecessarily long, Qwen achieve

备注: NVIDIA-Tech Report

点击查看摘要

Abstract:Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

64. 【2510.15103】Continual Learning via Sparse Memory Finetuning

链接https://arxiv.org/abs/2510.15103

作者:Jessy Lin,Luke Zettlemoyer,Gargi Ghosh,Wen-Tau Yih,Aram Markosyan,Vincent-Pierre Berges,Barlas Oğuz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Modern language models, Modern language, static after deployment, typically static, Modern

备注

点击查看摘要

Abstract:Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model's existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.

65. 【2510.15081】A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

链接https://arxiv.org/abs/2510.15081

作者:Shiyu Ji,Farnoosh Hashemi,Joice Chen,Juanwen Pan,Weicheng Ma,Hefan Zhang,Sophia Pan,Ming Cheng,Shubham Mohole,Saeed Hassanpour,Soroush Vosoughi,Michael Macy

类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:persuasive communication, legal argumentation, central to persuasive, political discourse, discourse and marketing

备注: The first two authors contributed equally

点击查看摘要

Abstract:Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.

66. 【2510.15061】Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models

链接https://arxiv.org/abs/2510.15061

作者:Samuel Paech,Allen Roush,Judah Goldfeder,Ravid Shwartz-Ziv

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Widespread LLM adoption, characteristic repetitive phraseology, introduced characteristic repetitive, Widespread LLM, text immediately recognizable

备注: 11 pages + appendices, 16 figures

点击查看摘要

Abstract:Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,'' which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000$\times$ more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90\% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: this https URL.

67. 【2510.15047】Internalizing World Models via Self-Play Finetuning for Agentic RL

链接https://arxiv.org/abs/2510.15047

作者:Shiqi Chen,Tongyao Zhu,Zian Wang,Jinghan Zhang,Kangrui Wang,Siyang Gao,Teng Xiao,Yee Whye Teh,Junxian He,Manling Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, world model, Large

备注

点击查看摘要

Abstract:Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k--the probability that at least one of (k) sampled trajectories succeeds--drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.

68. 【2510.15040】Composition-Grounded Instruction Synthesis for Visual Reasoning

链接https://arxiv.org/abs/2510.15040

作者:Xinyi Gu,Jiayuan Mao,Zhang-Wei Hong,Zhuoran Yu,Pengyuan Li,Dhiraj Joshi,Rogerio Feris,Zexue He

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Pretrained multi-modal large, diverse multimodal tasks, Pretrained multi-modal, large language models, multi-modal large language

备注

点击查看摘要

Abstract:Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

69. 【2510.15015】DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

链接https://arxiv.org/abs/2510.15015

作者:Mor Ventura,Michael Toker,Or Patashnik,Yonatan Belinkov,Roi Reichart

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:semantically related features, advanced rapidly, distinct entities, remain vulnerable, unintended transfer

备注

点击查看摘要

Abstract:Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

70. 【2510.15009】Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

链接https://arxiv.org/abs/2510.15009

作者:Enis Oğuz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Generative AI technologies, Generative, technologies have paved, numerous innovations, essays

备注

点击查看摘要

Abstract:The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.

71. 【2510.15007】Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

链接https://arxiv.org/abs/2510.15007

作者:Zhiqiang Kou,Junyang Chen,Xin-Qiang Cai,Ming-Kun Xie,Biao Liu,Changwei Wang,Lei Feng,Yuheng Jia,Gang Niu,Masashi Sugiyama,Xin Geng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, language processing tasks, natural language processing, Large language, achieved impressive results

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

72. 【2503.01933】Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective

链接https://arxiv.org/abs/2503.01933

作者:Rakshit Aralimatti,Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi

类目:Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deploying large scale, high computational demands, data privacy risks, edge devices faces, devices faces inherent

备注

点击查看摘要

Abstract:Deploying large scale language models on edge devices faces inherent challenges such as high computational demands, energy consumption, and potential data privacy risks. This paper introduces the Shakti Small Language Models (SLMs) Shakti-100M, Shakti-250M, and Shakti-500M which target these constraints headon. By combining efficient architectures, quantization techniques, and responsible AI principles, the Shakti series enables on-device intelligence for smartphones, smart appliances, IoT systems, and beyond. We provide comprehensive insights into their design philosophy, training pipelines, and benchmark performance on both general tasks (e.g., MMLU, Hellaswag) and specialized domains (healthcare, finance, and legal). Our findings illustrate that compact models, when carefully engineered and fine-tuned, can meet and often exceed expectations in real-world edge-AI scenarios.

73. 【2502.17092】Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

链接https://arxiv.org/abs/2502.17092

作者:Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:introduce Shakti VLM, parameters designed, family of vision-language, designed to address, introduce Shakti

备注

点击查看摘要

Abstract:We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

74. 【2411.00784】FIRE: Fact-checking with Iterative Retrieval and Verification

链接https://arxiv.org/abs/2411.00784

作者:Zhuohan Xie,Rui Xing,Yuxia Wang,Jiahui Geng,Hasan Iqbal,Dhruv Sahnan,Iryna Gurevych,Preslav Nakov

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Fact-checking long-form text, multiple atomic claims, text is challenging, long-form text, common practice

备注: 4 figures, 8 tables, accepted to Findings of NAACL

点击查看摘要

Abstract:Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model's internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations. Our code is available at this https URL.

75. 【2510.15691】Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction

链接https://arxiv.org/abs/2510.15691

作者:Tian Guo,Emmanuel Hauptmann

类目:Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:including stock selection, portfolio optimization, supports various tasks, risk management, quantitative investing

备注

点击查看摘要

Abstract:In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction.

76. 【2510.15020】he Coverage Principle: How Pre-training Enables Post-Training

链接https://arxiv.org/abs/2510.15020

作者:Fan Chen,Audrey Huang,Noah Golowich,Sadhika Malladi,Adam Block,Jordan T. Ash,Akshay Krishnamurthy,Dylan J. Foster

类目:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST)

关键词:remains poorly understood, Language models demonstrate, demonstrate remarkable abilities, large text corpora, models demonstrate remarkable

备注

点击查看摘要

Abstract:Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

信息检索

1. 【2510.15729】FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens

链接https://arxiv.org/abs/2510.15729

作者:Chao Wang,Yixin Song,Jinhui Ye,Chuan Qin,Dazhong Shen,Lingfeng Liu,Xiang Wang,Yanyong Zhang

类目:Information Retrieval (cs.IR)

关键词:personalizing user experiences, large language models, based recommendation systems, large language, collaborative filtering

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recently, large language models (LLMs) have been explored for integration with collaborative filtering (CF)-based recommendation systems, which are crucial for personalizing user experiences. However, a key challenge is that LLMs struggle to interpret the latent, non-semantic embeddings produced by CF approaches, limiting recommendation effectiveness and further applications. To address this, we propose FACE, a general interpretable framework that maps CF embeddings into pre-trained LLM tokens. Specifically, we introduce a disentangled projection module to decompose CF embeddings into concept-specific vectors, followed by a quantized autoencoder to convert continuous embeddings into LLM tokens (descriptors). Then, we design a contrastive alignment objective to ensure that the tokens align with corresponding textual signals. Hence, the model-agnostic FACE framework achieves semantic alignment without fine-tuning LLMs and enhances recommendation performance by leveraging their pre-trained capabilities. Empirical results on three real-world recommendation datasets demonstrate performance improvements in benchmark models, with interpretability studies confirming the interpretability of the descriptors. Code is available in this https URL.

2. 【2510.15722】he 3rd Place Solution of CCIR CUP 2025: A Framework for Retrieval-Augmented Generation in Multi-Turn Legal Conversation

链接https://arxiv.org/abs/2510.15722

作者:Da Li,Zecheng Fang,Qiang Yan,Wei Huang,Xuanpu Luo

类目:Information Retrieval (cs.IR)

关键词:made significant progress, natural language processing, made significant, significant progress, large language models

备注: CCIR2025

点击查看摘要

Abstract:Retrieval-Augmented Generation has made significant progress in the field of natural language processing. By combining the advantages of information retrieval and large language models, RAG can generate relevant and contextually appropriate responses based on items retrieved from reliable sources. This technology has demonstrated outstanding performance across multiple domains, but its application in the legal field remains in its exploratory phase. In this paper, we introduce our approach for "Legal Knowledge Retrieval and Generation" in CCIR CUP 2025, which leverages large language models and information retrieval systems to provide responses based on laws in response to user questions.

3. 【2510.15719】Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

链接https://arxiv.org/abs/2510.15719

作者:Helia Hashemi,Victor Rühle,Saravan Rajmohan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:gained significant attention, significant attention due, strong performance, attention due, Reasoning models

备注

点击查看摘要

Abstract:Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.

4. 【2510.15706】GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery

链接https://arxiv.org/abs/2510.15706

作者:Italo Luis da Silva,Hanqi Yan,Lin Gui,Yulan He

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, text generation capabilities, show strong reasoning

备注: 9 pages, 6 figures, 3 tables, EMNLP 2025 Demo paper

点击查看摘要

Abstract:Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have. While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce $\textbf{GraphMind}$, an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, $\textbf{GraphMind}$ enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. $\textbf{GraphMind}$ enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea's core contributions and its connections to existing work. $\textbf{GraphMind}$ is available at this https URL and a demonstration video at this https URL. The source code is available at this https URL.

5. 【2510.15683】Mixture of Experts Approaches in Dense Retrieval Tasks

链接https://arxiv.org/abs/2510.15683

作者:Effrosyni Sokli,Pranav Kasela,Georgios Peikos,Gabriella Pasi

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Dense Retrieval Models, development in Information, Dense Retrieval, Information Retrieval, prominent development

备注: 8 pages, 4 figures, 3 tables, reproducible code available at [this https URL](https://github.com/FaySokli/SB-MoE) , Accepted for publication in Proceedings of the 2025 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2025)

点击查看摘要

Abstract:Dense Retrieval Models (DRMs) are a prominent development in Information Retrieval (IR). A key challenge with these neural Transformer-based models is that they often struggle to generalize beyond the specific tasks and domains they were trained on. To address this challenge, prior research in IR incorporated the Mixture-of-Experts (MoE) framework within each Transformer layer of a DRM, which, though effective, substantially increased the number of additional parameters. In this paper, we propose a more efficient design, which introduces a single MoE block (SB-MoE) after the final Transformer layer. To assess the retrieval effectiveness of SB-MoE, we perform an empirical evaluation across three IR tasks. Our experiments involve two evaluation setups, aiming to assess both in-domain effectiveness and the model's zero-shot generalizability. In the first setup, we fine-tune SB-MoE with four different underlying DRMs on seven IR benchmarks and evaluate them on their respective test sets. In the second setup, we fine-tune SB-MoE on MSMARCO and perform zero-shot evaluation on thirteen BEIR datasets. Additionally, we perform further experiments to analyze the model's dependency on its hyperparameters (i.e., the number of employed and activated experts) and investigate how this variation affects SB-MoE's performance. The obtained results show that SB-MoE is particularly effective for DRMs with lightweight base models, such as TinyBERT and BERT-Small, consistently exceeding standard model fine-tuning across benchmarks. For DRMs with more parameters, such as BERT-Base and Contriever, our model requires a larger number of training samples to achieve improved retrieval performance. Our code is available online at: this https URL.

6. 【2510.15682】SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.15682

作者:Ines Besrour,Jingbo He,Tobias Schreieder,Michael Färber

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, https URL, multi-agent retrieval-augmented generation, retrieval-augmented generation, language models

备注: Accepted at CIKM 2025

点击查看摘要

Abstract:We present SQuAI (this https URL), a scalable and trustworthy multi-agent retrieval-augmented generation (RAG) framework for scientific question answering (QA) with large language models (LLMs). SQuAI addresses key limitations of existing RAG systems in the scholarly domain, where complex, open-domain questions demand accurate answers, explicit claims with citations, and retrieval across millions of scientific documents. Built on over 2.3 million full-text papers from arXiv.org, SQuAI employs four collaborative agents to decompose complex questions into sub-questions, retrieve targeted evidence via hybrid sparse-dense retrieval, and adaptively filter documents to improve contextual relevance. To ensure faithfulness and traceability, SQuAI integrates in-line citations for each generated claim and provides supporting sentences from the source documents. Our system improves faithfulness, answer relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG baseline. We further release a benchmark of 1,000 scientific question-answer-evidence triplets to support reproducibility. With transparent reasoning, verifiable citations, and domain-wide scalability, SQuAI demonstrates how multi-agent RAG enables more trustworthy scientific QA with LLMs.

7. 【2510.15647】Enhance Large Language Models as Recommendation Systems with Collaborative Filtering

链接https://arxiv.org/abs/2510.15647

作者:Zhisheng Yang,Xiaofei Xu,Ke Deng,Li Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, Large Language Models, Language Processing, Large Language, Natural Language

备注

点击查看摘要

Abstract:As powerful tools in Natural Language Processing (NLP), Large Language Models (LLMs) have been leveraged for crafting recommendations to achieve precise alignment with user preferences and elevate the quality of the recommendations. The existing approaches implement both non-tuning and tuning strategies. Compared to following the tuning strategy, the approaches following the non-tuning strategy avoid the relatively costly, time-consuming, and expertise-requiring process of further training pre-trained LLMs on task-specific datasets, but they suffer the issue of not having the task-specific business or local enterprise knowledge. To the best of our knowledge, none of the existing approaches following the non-tuning strategy explicitly integrates collaborative filtering, one of the most successful recommendation techniques. This study aims to fill the gap by proposing critique-based LLMs as recommendation systems (Critic-LLM-RS). For our purpose, we train a separate machine-learning model called Critic that implements collaborative filtering for recommendations by learning from the interactions between many users and items. The Critic provides critiques to LLMs to significantly refine the recommendations. Extensive experiments have verified the effectiveness of Critic-LLM-RS on real datasets.

8. 【2510.15543】MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

链接https://arxiv.org/abs/2510.15543

作者:Qiyu Wu,Shuyang Cui,Satoshi Hayakawa,Wei-Yao Wang,Hiromi Wakaki,Yuki Mitsufuji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:retrieve relevant content, contents production, text or image, supports applications, relevant content

备注

点击查看摘要

Abstract:Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

9. 【2510.15470】MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval

链接https://arxiv.org/abs/2510.15470

作者:Jinghao Huang,Yaxiong Chen,Ganchao Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:data increases rapidly, video data increases, drone, increases rapidly, creating an urgent

备注

点击查看摘要

Abstract:With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

10. 【2510.15428】Fault Cause Identification across Manufacturing Lines through Ontology-Guided and Process-Aware FMEA Graph Learning with LLMs

链接https://arxiv.org/abs/2510.15428

作者:Sho Okazaki,Kohei Kaminishi,Takuma Fujiu,Yusheng Wang,Jun Ota

类目:Information Retrieval (cs.IR)

关键词:Effects Analysis, Mode and Effects, existing Failure Mode, frequent reconfigurations, Failure Mode

备注

点击查看摘要

Abstract:Fault cause identification in automated manufacturing lines is challenging due to the system's complexity, frequent reconfigurations, and the limited reusability of existing Failure Mode and Effects Analysis (FMEA) knowledge. Although FMEA worksheets contain valuable expert insights, their reuse across heterogeneous lines is hindered by natural language variability, inconsistent terminology, and process differences. To address these limitations, this study proposes a process-aware framework that enhances FMEA reusability by combining manufacturing-domain conceptualization with graph neural network (GNN) reasoning. First, FMEA worksheets from multiple manufacturing lines are transformed into a unified knowledge graph through ontology-guided large language model (LLM) extraction, capturing domain concepts such as actions, states, components, and parameters. Second, a Relational Graph Convolutional Network (RGCN) with the process-aware scoring function learns embeddings that respect both semantic relationships and sequential process flows. Finally, link prediction is employed to infer and rank candidate fault causes consistent with the target line's process flow. A case study on automotive pressure sensor assembly lines demonstrates that the proposed method outperforms a state-of-the-art retrieval-augmented generation (RAG) baseline (F1@20 = 0.267) and an RGCN approach (0.400), achieving the best performance (0.523) in fault cause identification. Ablation studies confirm the contributions of both LLM-driven domain conceptualization and process-aware learning. These results indicate that the proposed framework significantly improves the transferability of FMEA knowledge across heterogeneous lines, thereby supporting operators in diagnosing failures more reliably and paving the way for future domain-adaptive LLM applications in smart manufacturing.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2510.15428 [cs.IR]

(or
arXiv:2510.15428v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.15428

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2510.15308】Dimension Mask Layer: Optimizing Embedding Efficiency for Scalable ID-based Models

链接https://arxiv.org/abs/2510.15308

作者:Srijan Saket,Ikuhiro Ihara,Vaibhav Sharma,Danish Kalim

类目:Information Retrieval (cs.IR)

关键词:modern recommendation systems, large-scale ID-based features, social media platforms, consume significant memory, large-scale ID-based

备注: 7 pages, 6 figures, 2 tables

点击查看摘要

Abstract:In modern recommendation systems and social media platforms like Meta, TikTok, and Instagram, large-scale ID-based features often require embedding tables that consume significant memory. Managing these embedding sizes can be challenging, leading to bulky models that are harder to deploy and maintain. In this paper, we introduce a method to automatically determine the optimal embedding size for ID features, significantly reducing the model size while maintaining performance. Our approach involves defining a custom Keras layer called the dimension mask layer, which sits directly after the embedding lookup. This layer trims the embedding vector by allowing only the first N dimensions to pass through. By doing this, we can reduce the input feature dimension by more than half with minimal or no loss in model performance metrics. This reduction helps cut down the memory footprint of the model and lowers the risk of overfitting due to multicollinearity. Through offline experiments on public datasets and an online A/B test on a real production dataset, we demonstrate that using a dimension mask layer can shrink the effective embedding dimension by 40-50\%, leading to substantial improvements in memory efficiency. This method provides a scalable solution for platforms dealing with a high volume of ID features, optimizing both resource usage and model performance.

Comments:
7 pages, 6 figures, 2 tables

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2510.15308 [cs.IR]

(or
arXiv:2510.15308v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.15308

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
12. 【2510.15299】GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework

链接https://arxiv.org/abs/2510.15299

作者:Yijia Sun,Shanshan Huang,Zhiyuan Guan,Qiang Luo,Ruiming Tang,Kun Gai,Guorui Zhou

类目:Information Retrieval (cs.IR)

关键词:Industrial-scale recommender systems, recommender systems rely, Industrial-scale recommender, high-recall candidate set, tight latency

备注

点击查看摘要

Abstract:Industrial-scale recommender systems rely on a cascade pipeline in which the retrieval stage must return a high-recall candidate set from billions of items under tight latency. Existing solutions ei- ther (i) suffer from limited expressiveness in capturing fine-grained user-item interactions, as seen in decoupled dual-tower architectures that rely on separate encoders, or generative models that lack precise target-aware matching capabilities, or (ii) build structured indices (tree, graph, quantization) whose item-centric topologies struggle to incorporate dynamic user preferences and incur prohibitive construction and maintenance costs. We present GRank, a novel structured-index-free retrieval paradigm that seamlessly unifies target-aware learning with user-centric retrieval. Our key innovations include: (1) A target-aware Generator trained to perform personalized candidate generation via GPU-accelerated MIPS, eliminating semantic drift and maintenance costs of structured indexing; (2) A lightweight but powerful Ranker that performs fine-grained, candidate-specific inference on small subsets; (3) An end-to-end multi-task learning framework that ensures semantic consistency between generation and ranking objectives. Extensive experiments on two public benchmarks and a billion-item production corpus demonstrate that GRank improves Recall@500 by over 30% and 1.7$\times$ the P99 QPS of state-of-the-art tree- and graph-based retrievers. GRank has been fully deployed in production in our recommendation platform since Q2 2025, serving 400 million active users with 99.95% service availability. Online A/B tests confirm significant improvements in core engagement metrics, with Total App Usage Time increasing by 0.160% in the main app and 0.165% in the Lite version.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2510.15299 [cs.IR]

(or
arXiv:2510.15299v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.15299

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yijia Sun [view email] [v1]
Fri, 17 Oct 2025 04:15:09 UTC (1,417 KB)

13. 【2510.15286】MTmixAtt: Integrating Mixture-of-Experts with Multi-Mix Attention for Large-Scale Recommendation

链接https://arxiv.org/abs/2510.15286

作者:Xianyang Qi,Yuan Tian,Zhaoyu Hu,Zhirui Kuai,Chang Liu,Hongxiang Lin,Lei Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:recommender systems critically, systems critically depend, Industrial recommender systems, high-quality ranking models, recommender systems

备注

点击查看摘要

Abstract:Industrial recommender systems critically depend on high-quality ranking models. However, traditional pipelines still rely on manual feature engineering and scenario-specific architectures, which hinder cross-scenario transfer and large-scale deployment. To address these challenges, we propose \textbf{MTmixAtt}, a unified Mixture-of-Experts (MoE) architecture with Multi-Mix Attention, designed for large-scale recommendation tasks. MTmixAtt integrates two key components. The \textbf{AutoToken} module automatically clusters heterogeneous features into semantically coherent tokens, removing the need for human-defined feature groups. The \textbf{MTmixAttBlock} module enables efficient token interaction via a learnable mixing matrix, shared dense experts, and scenario-aware sparse experts, capturing both global patterns and scenario-specific behaviors within a single framework. Extensive experiments on the industrial TRec dataset from Meituan demonstrate that MTmixAtt consistently outperforms state-of-the-art baselines including Transformer-based models, WuKong, HiFormer, MLP-Mixer, and RankMixer. At comparable parameter scales, MTmixAtt achieves superior CTR and CTCVR metrics; scaling to MTmixAtt-1B yields further monotonic gains. Large-scale online A/B tests validate the real-world impact: in the \textit{Homepage} scenario, MTmixAtt increases Payment PV by \textbf{+3.62\%} and Actual Payment GTV by \textbf{+2.54\%}. Overall, MTmixAtt provides a unified and scalable solution for modeling arbitrary heterogeneous features across scenarios, significantly improving both user experience and commercial outcomes.

14. 【2510.15238】HOB: A Holistically Optimized Bidding Strategy under Heterogeneous Auction Mechanisms with Organic Traffic

链接https://arxiv.org/abs/2510.15238

作者:Qi Li,Wendong Huang,Qichen Ye,Wutong Xu,Cheems Wang,Rongquan Bai,Wei Yuan,Guan Wang,Chuan Yu,Jian Xu

类目:Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:E-commerce advertising platforms, advertising platforms typically, platforms typically sell, typically sell commercial, E-commerce advertising

备注

点击查看摘要

Abstract:The E-commerce advertising platforms typically sell commercial traffic through either second-price auction (SPA) or first-price auction (FPA). SPA was historically prevalent due to its dominant strategy incentive-compatible (DSIC) for bidders with quasi-linear utilities, especially when budgets are not a binding constraint, while FPA has gained more prominence for offering higher revenue potential to publishers and avoiding the possibility for discriminatory treatment in personalized reserve prices. Meanwhile, on the demand side, advertisers are increasingly adopting platform-wide marketing solutions akin to QuanZhanTui, shifting from spending budgets solely on commercial traffic to bidding on the entire traffic for the purpose of maximizing overall sales. For automated bidding systems, such a trend poses a critical challenge: determining optimal strategies across heterogeneous auction channels to fulfill diverse advertiser objectives, such as maximizing return (MaxReturn) or meeting target return on ad spend (TargetROAS). To overcome this challenge, this work makes two key contributions. First, we derive an efficient solution for optimal bidding under FPA channels, which takes into account the presence of organic traffic - traffic can be won for free. Second, we introduce a marginal cost alignment (MCA) strategy that provably secures bidding efficiency across heterogeneous auction mechanisms. To validate performance of our developed framework, we conduct comprehensive offline experiments on public datasets and large-scale online A/B testing, which demonstrate consistent improvements over existing methods.

15. 【2510.15191】Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

链接https://arxiv.org/abs/2510.15191

作者:Junlin Wu,Xianrui Zhong,Jiashuo Sun,Bolian Li,Bowen Jin,Jiawei Han,Qingkai Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, demonstrated remarkable advances, Large language, demonstrated remarkable, remarkable advances

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: this https URL.

16. 【2510.15087】DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management

链接https://arxiv.org/abs/2510.15087

作者:Kai Yin,Xiangjue Dong,Chengkai Liu,Allen Lin,Lingfeng Shi,Ali Mostafavi,James Caverlee

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Effective and efficient, disaster management, disaster management scenarios, efficient access, access to relevant

备注

点击查看摘要

Abstract:Effective and efficient access to relevant information is essential for disaster management. However, no retrieval model is specialized for disaster management, and existing general-domain models fail to handle the varied search intents inherent to disaster management scenarios, resulting in inconsistent and unreliable performance. To this end, we introduce DMRetriever, the first series of dense retrieval models (33M to 7.6B) tailored for this domain. It is trained through a novel three-stage framework of bidirectional attention adaptation, unsupervised contrastive pre-training, and difficulty-aware progressive instruction fine-tuning, using high-quality data generated through an advanced data refinement pipeline. Comprehensive experiments demonstrate that DMRetriever achieves state-of-the-art (SOTA) performance across all six search intents at every model scale. Moreover, DMRetriever is highly parameter-efficient, with 596M model outperforming baselines over 13.3 X larger and 33M model exceeding baselines with only 7.6% of their parameters. All codes, data, and checkpoints are available at this https URL

17. 【2411.00784】FIRE: Fact-checking with Iterative Retrieval and Verification

链接https://arxiv.org/abs/2411.00784

作者:Zhuohan Xie,Rui Xing,Yuxia Wang,Jiahui Geng,Hasan Iqbal,Dhruv Sahnan,Iryna Gurevych,Preslav Nakov

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Fact-checking long-form text, multiple atomic claims, text is challenging, long-form text, common practice

备注: 4 figures, 8 tables, accepted to Findings of NAACL

点击查看摘要

Abstract:Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model's internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations. Our code is available at this https URL.

计算机视觉

1. 【2510.15870】OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

链接https://arxiv.org/abs/2510.15870

作者:Hanrong Ye,Chao-Han Huck Yang,Arushi Goel,Wei Huang,Ligeng Zhu,Yuanhang Su,Sean Lin,An-Chieh Cheng,Zhen Wan,Jinchuan Tian,Yuming Lou,Dong Yang,Zhijian Liu,Yukang Chen,Ambrish Dantrey,Ehsan Jahangiri,Sreyan Ghosh,Daguang Xu,Ehsan Hosseini-Asl,Danial Mohseni Taheri,Vidya Murali,Sifei Liu,Jason Lu,Oluwatobi Olabiyi,Frank Wang,Rafael Valle,Bryan Catanzaro,Andrew Tao,Song Han,Jan Kautz,Hongxu Yin,Pavlo Molchanov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Advancing machine intelligence, machine intelligence requires, intelligence requires developing, Advancing machine, sense the world

备注: Technical Report. Code: [this https URL](https://github.com/NVlabs/OmniVinci)

点击查看摘要

Abstract:Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

2. 【2510.15869】Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

链接https://arxiv.org/abs/2510.15869

作者:Jie-Ying Lee,Yi-Ruei Liu,Shr-Ruei Tsai,Wei-Cheng Chang,Chung-Ho Wu,Jiewen Chan,Zhenjun Zhao,Chieh Hubert Lin,Yu-Lun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthesizing large-scale, geometrically accurate, embodied applications, challenging yet valuable, valuable task

备注: Project page: [this https URL](https://skyfall-gs.jayinnn.dev/)

点击查看摘要

Abstract:Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: this https URL

3. 【2510.15868】LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

链接https://arxiv.org/abs/2510.15868

作者:Shr-Ruei Tsai,Wei-Cheng Chang,Jie-Ying Lee,Chih-Hai Su,Yu-Lun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Lens flare significantly, impacting critical computer, degrades image quality, critical computer vision, computer vision tasks

备注: ICCV 2025. Project page: [this https URL](https://ray-1026.github.io/lightsout/)

点击查看摘要

Abstract:Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: this https URL

4. 【2510.15866】BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models

链接https://arxiv.org/abs/2510.15866

作者:Kaushitha Silva,Mansitha Eashwara,Sanduni Ubayasiri,Ruwan Tennakoon,Damayanthi Herath

类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:uninterpretable latent vectors, prompt optimization techniques, single textual prompts, optimization techniques, techniques that produce

备注: 10 Pages + 15 Supplementary Material Pages, 5 figures

点击查看摘要

Abstract:The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model's performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.

5. 【2510.15857】BLIP3o-NEXT: Next Frontier of Native Image Generation

链接https://arxiv.org/abs/2510.15857

作者:Jiuhai Chen,Le Xue,Zhiyang Xu,Xichen Pan,Shusheng Yang,Can Qin,An Yan,Honglu Zhou,Zeyuan Chen,Lifu Huang,Tianyi Zhou,Junnan Li,Silvio Savarese,Caiming Xiong,Ran Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:native image generation, fully open-source foundation, image generation, open-source foundation model, native image

备注

点击查看摘要

Abstract:We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

6. 【2510.15849】Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

链接https://arxiv.org/abs/2510.15849

作者:Joongwon Chae,Lihui Luo,Xi Yuan,Dongmei Yu,Zhenglin Chen,Lian Zhang,Peiwu Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reliable TCM analysis, TCM analysis, reliable TCM, Accurate tongue segmentation, Accurate tongue

备注

点击查看摘要

Abstract:Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at this https URL.

7. 【2510.15846】3DPR: Single Image 3D Portrait Relight using Generative Priors

链接https://arxiv.org/abs/2510.15846

作者:Pramod Rao,Abhimitra Meka,Xilong Zhou,Gereon Fox,Mallikarjun B R,Fangneng Zhan,Tim Weyrich,Bernd Bickel,Hanspeter Pfister,Wojciech Matusik,Thabo Beeler,Mohamed Elgharib,Marc Habermann,Christian Theobalt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherently underconstrained problem, relit views, underconstrained problem, inherently underconstrained, monocular portrait image

备注: Accepted at ACM SIGGRAPH ASIA 2025 Conference Proceedings

点击查看摘要

Abstract:Rendering novel, relit views of a human head, given a monocular portrait image as input, is an inherently underconstrained problem. The traditional graphics solution is to explicitly decompose the input image into geometry, material and lighting via differentiable rendering; but this is constrained by the multiple assumptions and approximations of the underlying models and parameterizations of these scene components. We propose 3DPR, an image-based relighting model that leverages generative priors learnt from multi-view One-Light-at-A-Time (OLAT) images captured in a light stage. We introduce a new diverse and large-scale multi-view 4K OLAT dataset of 139 subjects to learn a high-quality prior over the distribution of high-frequency face reflectance. We leverage the latent space of a pre-trained generative head model that provides a rich prior over face geometry learnt from in-the-wild image datasets. The input portrait is first embedded in the latent manifold of such a model through an encoder-based inversion process. Then a novel triplane-based reflectance network trained on our lightstage data is used to synthesize high-fidelity OLAT images to enable image-based relighting. Our reflectance network operates in the latent space of the generative head model, crucially enabling a relatively small number of lightstage images to train the reflectance model. Combining the generated OLATs according to a given HDRI environment maps yields physically accurate environmental relighting results. Through quantitative and qualitative evaluations, we demonstrate that 3DPR outperforms previous methods, particularly in preserving identity and in capturing lighting effects such as specularities, self-shadows, and subsurface scattering. Project Page: this https URL

8. 【2510.15842】Paper2Web: Let's Make Your Paper Alive!

链接https://arxiv.org/abs/2510.15842

作者:Yuhang Chen,Tianpeng Lv,Siyi Zhang,Yixiang Yin,Yao Wan,Philip S. Yu,Dongping Chen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:effectively disseminate research, enable intuitive navigation, Academic project websites, Large Language Model, academic webpage generation

备注: Under Review. Check [this https URL](https://github.com/YuhangChen1/Paper2All) for the unified platform to streamline all academic presentation

点击查看摘要

Abstract:Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.

9. 【2510.15841】Neuro-Symbolic Spatial Reasoning in Segmentation

链接https://arxiv.org/abs/2510.15841

作者:Jiayi Lin,Jiabo Huang,Shaogang Gong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assigns pixel-level labels, Open-Vocabulary Semantic Segmentation, reasoning in OVSS, assigns pixel-level, requiring generalization

备注

点击查看摘要

Abstract:Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., cat, to-right-of, person, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

10. 【2510.15831】VISTA: A Test-Time Self-Improving Video Generation Agent

链接https://arxiv.org/abs/2510.15831

作者:Do Xuan Long,Xingchen Wan,Hootan Nakhost,Chen-Yu Lee,Tomas Pfister,Sercan Ö. Arık

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains critically dependent, quality remains critically, rapid advances, remains critically, critically dependent

备注

点击查看摘要

Abstract:Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

11. 【2510.15800】ERNet: Efficient Non-Rigid Registration Network for Point Sequences

链接https://arxiv.org/abs/2510.15800

作者:Guangzhao He,Yuxi Xiao,Zhen Xu,Xiaowei Zhou,Sida Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point clouds undergoing, clouds undergoing non-rigid, Registering an object, undergoing non-rigid deformation, object shape

备注: Accepted to ICCV 2025. Project Page: [this https URL](https://guangzhaohe.com/ernet)

点击查看摘要

Abstract:Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state-of-the-art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.

12. 【2510.15783】ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection

链接https://arxiv.org/abs/2510.15783

作者:Haowei Zhu,Tianxiang Pan,Rui Qin,Jun-Hai Yong,Bin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:training robust perception, crucial for training, training robust, robust perception models, robust perception

备注: Accepted to NeurIPS 2025 (spotlight)

点击查看摘要

Abstract:The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at this https URL .

13. 【2510.15778】Controlling the image generation process with parametric activation functions

链接https://arxiv.org/abs/2510.15778

作者:Ilia Pavlov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:leverage direct interaction, replace activation functions, BigGAN networks trained, image generative models, generative models continue

备注: 5 pages, 5 figures, accepted for the 16th International Conference on Computational Creativity, ICCC'25

点击查看摘要

Abstract:As image generative models continue to increase not only in their fidelity but also in their ubiquity the development of tools that leverage direct interaction with their internal mechanisms in an interpretable way has received little attention In this work we introduce a system that allows users to develop a better understanding of the model through interaction and experimentation By giving users the ability to replace activation functions of a generative network with parametric ones and a way to set the parameters of these functions we introduce an alternative approach to control the networks output We demonstrate the use of our method on StyleGAN2 and BigGAN networks trained on FFHQ and ImageNet respectively.

14. 【2510.15770】owards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model

链接https://arxiv.org/abs/2510.15770

作者:Gaoxiang Huang,Songning Lai,Yutao Yue

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:predicting human-understandable concepts, Concept Bottleneck Models, Disentangled Concept Bottleneck, intermediate representations, Concept Bottleneck

备注

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.

15. 【2510.15761】QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion

链接https://arxiv.org/abs/2510.15761

作者:Denis Rychkovskiy(DZRobo, Independent Researcher)

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:always-on stabilization layer, rare activation spikes, improves high-frequency fidelity, suppressing rare activation, Adaptive Quantile Clip

备注: Preprint. Qualitative side-by-side comparisons (fixed seeds); 3 figures with subfigures; 1 algorithm. CADE 2.5 / SDXL integration; sample images included. Code and presets planned for release upon publication

点击查看摘要

Abstract:We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.

16. 【2510.15757】Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity

链接https://arxiv.org/abs/2510.15757

作者:Pieris Panagi,Savvas Karatsiolis,Kyriacos Mosphilis,Nicholas Hadjisavvas,Andreas Kamilaris,Nicolas Nicolaou,Efstathios Stavrakis,Vassilis Vassiliades

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:farming faces increasing, faces increasing pressure, meet productivity targets, Poultry farming faces, ensuring animal welfare

备注

点击查看摘要

Abstract:Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics Alerting, Real-Time Egg Counting, Production Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability.

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Cite as:
arXiv:2510.15757 [cs.LG]

(or
arXiv:2510.15757v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.15757

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2510.15756】Semantic segmentation with coarse annotations

链接https://arxiv.org/abs/2510.15756

作者:Jort de Jong,Mike Holenderski

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Semantic segmentation, task of classifying, coarse annotations, Semantic, annotations

备注

点击查看摘要

Abstract:Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.

18. 【2510.15752】NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

链接https://arxiv.org/abs/2510.15752

作者:Yitong Sun,Yao Huang,Ruochen Zhang,Huanran Chen,Shouwei Ruan,Ranjie Duan,Xingxing Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generating inappropriate content, remain vulnerable, vulnerable to generating, generating inappropriate, implicit sexual prompts

备注: 10 pages, 8 figures, accepted by ACMMM 2025

点击查看摘要

Abstract:Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at this https URL.

19. 【2510.15749】SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior

链接https://arxiv.org/abs/2510.15749

作者:Haoran Wang,Bo Zhao,Jinghui Wang,Hanzhang Wang,Huan Yang,Wei Ji,Hao Liu,Xinyan Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:automatically generate layouts, layout generation problem, content-aware layout generation, background image, aims to automatically

备注: Accepted by ICCV-2025, Our project website is at: [this https URL](https://brucew91.github.io/SEGA.github.io/) , 10 pages

点击查看摘要

Abstract:In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: this https URL

20. 【2510.15742】Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

链接https://arxiv.org/abs/2510.15742

作者:Qingyan Bai,Qiuyu Wang,Hao Ouyang,Yue Yu,Hanlin Wang,Wen Wang,Ka Leong Cheng,Shuailei Ma,Yanhong Zeng,Zichen Liu,Yinghao Xu,Yujun Shen,Qifeng Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:democratize content creation, high-quality training data, content creation, scarcity of large-scale, high-quality training

备注: Project page: [this https URL](https://ezioby.github.io/Ditto_page) Code: [this https URL](https://github.com/EzioBy/Ditto)

点击查看摘要

Abstract:Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

21. 【2510.15736】Fix False Transparency by Noise Guided Splatting

链接https://arxiv.org/abs/2510.15736

作者:Aly El Hakie,Yiren Lu,Yu Yin,Michael Jenkins,Yehe Liu

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:leading to inconsistent, internal patterns, patterns under camera, camera motion, Opaque objects reconstructed

备注

点击查看摘要

Abstract:Opaque objects reconstructed by 3DGS often exhibit a falsely transparent surface, leading to inconsistent background and internal patterns under camera motion in interactive viewing. This issue stems from the ill-posed optimization in 3DGS. During training, background and foreground Gaussians are blended via alpha-compositing and optimized solely against the input RGB images using a photometric loss. As this process lacks an explicit constraint on surface opacity, the optimization may incorrectly assign transparency to opaque regions, resulting in view-inconsistent and falsely transparent. This issue is difficult to detect in standard evaluation settings but becomes particularly evident in object-centric reconstructions under interactive viewing. Although other causes of view-inconsistency have been explored recently, false transparency has not been explicitly identified. To the best of our knowledge, we are the first to identify, characterize, and develop solutions for this artifact, an underreported artifact in 3DGS. Our strategy, NGS, encourages surface Gaussians to adopt higher opacity by injecting opaque noise Gaussians in the object volume during training, requiring only minimal modifications to the existing splatting process. To quantitatively evaluate false transparency in static renderings, we propose a transmittance-based metric that measures the severity of this artifact. In addition, we introduce a customized, high-quality object-centric scan dataset exhibiting pronounced transparency issues, and we augment popular existing datasets with complementary infill noise specifically designed to assess the robustness of 3D reconstruction methods to false transparency. Experiments across multiple datasets show that NGS substantially reduces false transparency while maintaining competitive performance on standard rendering metrics, demonstrating its overall effectiveness.

22. 【2510.15725】DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

链接https://arxiv.org/abs/2510.15725

作者:Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:Camera movement classification, low contrast obscure, Camera movement, obscure motion cues, contrast obscure motion

备注: 9 pages, accepted at ACMMM2025 SUMAC

点击查看摘要

Abstract:Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at this https URL.

23. 【2510.15710】Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

链接https://arxiv.org/abs/2510.15710

作者:Junzhi Ning,Wei Li,Cheng Tang,Jiashi Lin,Chenglong Ma,Chaoyang Zhang,Jiyao Liu,Ying Chen,Shujian Gao,Lihao Liu,Yuandong Pu,Huihui Xu,Chenhui Gou,Ziyan Huang,Yi Xin,Qi Qin,Zhongying Deng,Diping Song,Bin Fu,Guang Yang,Yuanfeng Ji,Tianbin Li,Yanzhou Su,Jin Ye,Shixiang Tang,Ming Hu,Junjun He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnostic applications require, applications require models, patient histories, lab results, segmentation masks

备注

点击查看摘要

Abstract:Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at this https URL.

24. 【2510.15684】owards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI

链接https://arxiv.org/abs/2510.15684

作者:Gerard Comas-Quiles,Carles Garcia-Cabrera,Julia Dietlmeier,Noel E. O'Connor,Ferran Marques

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision Transformer Autoencoder, magnetic resonance imaging, Multimodal Vision Transformer, alternative to supervised, supervised learning

备注: 10 pages, 5 figures, BraTS GoAT 2025 challenge

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.

25. 【2510.15673】Valeo Near-Field: a novel dataset for pedestrian intent detection

链接https://arxiv.org/abs/2510.15673

作者:Antonyo Musabini,Rachid Benmokhtar,Jagdish Bhanushali,Victor Galizzi,Bertrand Luvison,Xavier Perrotton

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:detecting pedestrians' intentions, approach an ego-vehicle, paper presents, aimed at detecting, detecting pedestrians'

备注

点击查看摘要

Abstract:This paper presents a novel dataset aimed at detecting pedestrians' intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.

26. 【2510.15666】Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation

链接https://arxiv.org/abs/2510.15666

作者:Lei Shi,Gang Li,Junxing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatic medical image, Automatic medical, supervised approaches demand, approaches demand extensive, demand extensive pixel-level

备注

点击查看摘要

Abstract:Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.

27. 【2510.15615】Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey

链接https://arxiv.org/abs/2510.15615

作者:Shuchang Lyu,Qi Zhao,Zheng Zhou,Meng Li,You Zhou,Dingding Yao,Guangliang Cheng,Huiyu Zhou,Zhenwei Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:differently distributed target, remote sensing, distributed target domain, increasingly important task, Domain adaptation

备注: 30 pages, 7 figures

点击查看摘要

Abstract:Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.

28. 【2510.15611】Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration

链接https://arxiv.org/abs/2510.15611

作者:Tomáš Chobola,Julia A. Schnabel,Tingying Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current self-supervised denoising, achieve impressive results, Current self-supervised, techniques achieve impressive, impressive results

备注: 10 pages, MICCAI 2025

点击查看摘要

Abstract:Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.

29. 【2510.15602】Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection

链接https://arxiv.org/abs/2510.15602

作者:Andrei-Timotei Ardelean,Patrick Rückbeil,Tim Weyrich

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Zero-shot anomaly localization, computer vision research, Zero-shot anomaly, vision research, recent years

备注: 13 pages, 10 figures. Published in the 30th Intl. Conference on Vision, Modeling, and Visualization (VMV), 2025

点击查看摘要

Abstract:Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: this https URL

30. 【2510.15595】FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification

链接https://arxiv.org/abs/2510.15595

作者:Zhen Sun,Lei Tan,Yunhang Shen,Chengmao Cai,Xing Sun,Pingyang Dai,Liujuan Cao,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:match pedestrian images, Multimodal person re-identification, person re-identification, aims to match, match pedestrian

备注

点击查看摘要

Abstract:Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

31. 【2510.15591】Context-aware deep learning using individualized prior information reduces false positives in disease risk prediction and longitudinal health assessment

链接https://arxiv.org/abs/2510.15591

作者:Lavanya Umapathy,Patricia M Johnson,Tarun Dutt,Angela Tong,Madhur Nayan,Hersh Chandarana,Daniel K Sodickson

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Temporal context, False positive rates, medicine is valuable, valuable in assessing, assessing key

备注: 18 pages, 5 figures, 1 table

点击查看摘要

Abstract:Temporal context in medicine is valuable in assessing key changes in patient health over time. We developed a machine learning framework to integrate diverse context from prior visits to improve health monitoring, especially when prior visits are limited and their frequency is variable. Our model first estimates initial risk of disease using medical data from the most recent patient visit, then refines this assessment using information digested from previously collected imaging and/or clinical biomarkers. We applied our framework to prostate cancer (PCa) risk prediction using data from a large population (28,342 patients, 39,013 magnetic resonance imaging scans, 68,931 blood tests) collected over nearly a decade. For predictions of the risk of clinically significant PCa at the time of the visit, integrating prior context directly converted false positives to true negatives, increasing overall specificity while preserving high sensitivity. False positive rates were reduced progressively from 51% to 33% when integrating information from up to three prior imaging examinations, as compared to using data from a single visit, and were further reduced to 24% when also including additional context from prior clinical data. For predicting the risk of PCa within five years of the visit, incorporating prior context reduced false positive rates still further (64% to 9%). Our findings show that information collected over time provides relevant context to enhance the specificity of medical risk prediction. For a wide range of progressive conditions, sufficient reduction of false positive rates using context could offer a pathway to expand longitudinal health monitoring programs to large populations with comparatively low baseline risk of disease, leading to earlier detection and improved health outcomes.

32. 【2510.15589】Standardization for improved Spatio-Temporal Image Fusion

链接https://arxiv.org/abs/2510.15589

作者:Harkaitz Goyena,Peter M. Atkinson,Unai Pérez-Goya,M. Dolores Ugarte

类目:Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)

关键词:Spatio-Temporal Image Fusion, spectral resolutions captured, Spatio Temporal Fusion, require sets, resolutions captured

备注

点击查看摘要

Abstract:Spatio-Temporal Image Fusion (STIF) methods usually require sets of images with matching spatial and spectral resolutions captured by different sensors. To facilitate the application of STIF methods, we propose and compare two different standardization approaches. The first method is based on traditional upscaling of the fine-resolution images. The second method is a sharpening approach called Anomaly Based Satellite Image Standardization (ABSIS) that blends the overall features found in the fine-resolution image series with the distinctive attributes of a specific coarse-resolution image to produce images that more closely resemble the outcome of aggregating the fine-resolution images. Both methods produce a significant increase in accuracy of the Unpaired Spatio Temporal Fusion of Image Patches (USTFIP) STIF method, with the sharpening approach increasing the spectral and spatial accuracies of the fused images by up to 49.46\% and 78.40\%, respectively.

33. 【2510.15579】Lightweight CycleGAN Models for Cross-Modality Image Transformation and Experimental Quality Assessment in Fluorescence Microscopy

链接https://arxiv.org/abs/2510.15579

作者:Mohammad Soltaninezhad,Yashar Rouzbahani,Jhonatan Contreras,Rohan Chippalkatti,Daniel Kwaku Abankwa,Christian Eggeling,Thomas Bocklitz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:offer substantial reductions, Lightweight deep learning, learning models offer, models offer substantial, deep learning models

备注: 17 pages, 8 Figures

点击查看摘要

Abstract:Lightweight deep learning models offer substantial reductions in computational cost and environmental impact, making them crucial for scientific applications. We present a lightweight CycleGAN for modality transfer in fluorescence microscopy (confocal to super-resolution STED/deconvolved STED), addressing the common challenge of unpaired datasets. By replacing the traditional channel-doubling strategy in the U-Net-based generator with a fixed channel approach, we drastically reduce trainable parameters from 41.8 million to approximately nine thousand, achieving superior performance with faster training and lower memory usage. We also introduce the GAN as a diagnostic tool for experimental and labeling quality. When trained on high-quality images, the GAN learns the characteristics of optimal imaging; deviations between its generated outputs and new experimental images can reveal issues such as photobleaching, artifacts, or inaccurate labeling. This establishes the model as a practical tool for validating experimental accuracy and image fidelity in microscopy workflows.

34. 【2510.15576】Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images

链接https://arxiv.org/abs/2510.15576

作者:Sami Belguesmia,Mohand Saïd Allili,Assia Hamadene

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly realistic synthetic, realistic synthetic face, recent years, enabling the creation, technology has advanced

备注

点击查看摘要

Abstract:DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting this http URL results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches

35. 【2510.15564】Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation

链接https://arxiv.org/abs/2510.15564

作者:Xiaoming Zhu,Xu Huang,Qinghongbing Xie,Zhi Deng,Junsheng Yu,Yirui Guan,Zhongyuan Liu,Lin Zhu,Qijun Zhao,Ligang Liu,Long Zeng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:digital content creation, Generating artistic, artistic and coherent, crucial in digital, content creation

备注

点击查看摘要

Abstract:Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at this https URL.

36. 【2510.15557】ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents

链接https://arxiv.org/abs/2510.15557

作者:Tingyu Lin,Marco Peer,Florian Kleber,Robert Sablatnig

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:paper presents ClapperText, World War II-era, paper presents, printed text recognition, World War

备注: 18 pages, accepted at ICDAR2025 DALL

点击查看摘要

Abstract:This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at this https URL.

37. 【2510.15556】Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics

链接https://arxiv.org/abs/2510.15556

作者:Yitong Li,Ralph Buchert,Benita Schmitz-Koep,Timo Grimmer,Björn Ommer,Dennis M. Hedderich,Igor Yakushev,Christian Wachinger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Positron emission tomography, Positron emission, emission tomography, FDG, MRI

备注

点击查看摘要

Abstract:Positron emission tomography (PET) with 18F-Fluorodeoxyglucose (FDG) is an established tool in the diagnostic workup of patients with suspected dementing disorders. However, compared to the routinely available magnetic resonance imaging (MRI), FDG-PET remains significantly less accessible and substantially more expensive. Here, we present SiM2P, a 3D diffusion bridge-based framework that learns a probabilistic mapping from MRI and auxiliary patient information to simulate FDG-PET images of diagnostic quality. In a blinded clinical reader study, two neuroradiologists and two nuclear medicine physicians rated the original MRI and SiM2P-simulated PET images of patients with Alzheimer's disease, behavioral-variant frontotemporal dementia, and cognitively healthy controls. SiM2P significantly improved the overall diagnostic accuracy of differentiating between three groups from 75.0% to 84.7% (p0.05). Notably, the simulated PET images received higher diagnostic certainty ratings and achieved superior interrater agreement compared to the MRI images. Finally, we developed a practical workflow for local deployment of the SiM2P framework. It requires as few as 20 site-specific cases and only basic demographic information. This approach makes the established diagnostic benefits of FDG-PET imaging more accessible to patients with suspected dementing disorders, potentially improving early detection and differential diagnosis in resource-limited settings. Our code is available at this https URL.

38. 【2510.15541】An Empirical Study on MC Dropout--Based Uncertainty--Error Correlation in 2D Brain Tumor Segmentation

链接https://arxiv.org/abs/2510.15541

作者:Saumya B

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Accurate brain tumor, Accurate brain, treatment planning, brain tumor MRI, vital for diagnosis

备注: Code and results available at [this https URL](https://github.com/Saumya4321/mc-dropout-boundary-correlation)

点击查看摘要

Abstract:Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, its effectiveness in identifying segmentation errors -- especially near tumor boundaries -- remains unclear. This study empirically examines the relationship between MC Dropout--based uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was computed from 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations ($r \approx 0.30$--$0.38$) and negligible boundary correlations ($|r| 0.05$). Although differences across augmentations were statistically significant ($p 0.001$), they lacked practical relevance. These findings suggest that MC Dropout uncertainty provides limited cues for boundary error localization, underscoring the need for alternative or hybrid uncertainty estimation methods in medical image segmentation.

39. 【2510.15530】VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

链接https://arxiv.org/abs/2510.15530

作者:Zehao Ni,Yonghao He,Lingfeng Qian,Jilei Mao,Fa Fu,Wei Sui,Hu Su,Junran Peng,Zhipeng Wang,Bin He

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diffusion policy learning, visuomotor-based diffusion policy, context of imitation, main directions, diffusion policy

备注

点击查看摘要

Abstract:In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.

40. 【2510.15527】Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training

链接https://arxiv.org/abs/2510.15527

作者:Aditya Vir

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:custom convolutional neural, convolutional neural network, balanced multi-task attention, satellite imagery classification, work presents

备注: 7 pages, 2 figures, 2 tables. Code and trained models available at [this https URL](https://github.com/virAditya/satellite-image-classification-eurosat)

点击查看摘要

Abstract:This work presents a systematic investigation of custom convolutional neural network architectures for satellite land use classification, achieving 97.23% test accuracy on the EuroSAT dataset without reliance on pre-trained models. Through three progressive architectural iterations (baseline: 94.30%, CBAM-enhanced: 95.98%, and balanced multi-task attention: 97.23%) we identify and address specific failure modes in satellite imagery classification. Our principal contribution is a novel balanced multi-task attention mechanism that combines Coordinate Attention for spatial feature extraction with Squeeze-Excitation blocks for spectral feature extraction, unified through a learnable fusion parameter. Experimental results demonstrate that this learnable parameter autonomously converges to alpha approximately 0.57, indicating near-equal importance of spatial and spectral modalities for satellite imagery. We employ progressive DropBlock regularization (5-20% by network depth) and class-balanced loss weighting to address overfitting and confusion pattern imbalance. The final 12-layer architecture achieves Cohen's Kappa of 0.9692 with all classes exceeding 94.46% accuracy, demonstrating confidence calibration with a 24.25% gap between correct and incorrect predictions. Our approach achieves performance within 1.34% of fine-tuned ResNet-50 (98.57%) while requiring no external data, validating the efficacy of systematic architectural design for domain-specific applications. Complete code, trained models, and evaluation scripts are publicly available.

41. 【2510.15520】Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models

链接https://arxiv.org/abs/2510.15520

作者:Ignacio Serna

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:exhibit systematic biases, Modern face recognition, models achieve high, Modern face, achieve high

备注

点击查看摘要

Abstract:Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.

42. 【2510.15510】Exploring Conditions for Diffusion models in Robotic Control

链接https://arxiv.org/abs/2510.15510

作者:Heeseong Shin,Byeongho Heo,Dongyoon Han,Seungryong Kim,Taekyung Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:advanced imitation learning, imitation learning, policy learning, significantly advanced imitation, advanced imitation

备注: Project page: [this https URL](https://orca-rc.github.io/)

点击查看摘要

Abstract:While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

43. 【2510.15497】Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement

链接https://arxiv.org/abs/2510.15497

作者:Xianmin Chen,Peiliang Huang,Longfei Han,Dingwen Zhang,Junwei Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-light RAW image, RAW image enhancement, Low-light RAW, challenging task, RAW image

备注

点击查看摘要

Abstract:Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at this https URL.

44. 【2510.15491】Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions

链接https://arxiv.org/abs/2510.15491

作者:Andre Rochow,Jonas Marcic,Svetlana Seliunina,Sven Behnke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding plant growth, yield prediction, disease control, plays a crucial, crucial role

备注

点击查看摘要

Abstract:3D phenotyping of plants plays a crucial role for understanding plant growth, yield prediction, and disease control. We present a pipeline capable of generating high-quality 3D reconstructions of individual agricultural plants. To acquire data, a small commercially available UAV captures images of a selected plant. Apart from placing ArUco markers, the entire image acquisition process is fully autonomous, controlled by a self-developed Android application running on the drone's controller. The reconstruction task is particularly challenging due to environmental wind and downwash of the UAV. Our proposed pipeline supports the integration of arbitrary state-of-the-art 3D reconstruction methods. To mitigate errors caused by leaf motion during image capture, we use an iterative method that gradually adjusts the input images through deformation. Motion is estimated using optical flow between the original input images and intermediate 3D reconstructions rendered from the corresponding viewpoints. This alignment gradually reduces scene motion, resulting in a canonical representation. After a few iterations, our pipeline improves the reconstruction of state-of-the-art methods and enables the extraction of high-resolution 3D meshes. We will publicly release the source code of our reconstruction pipeline. Additionally, we provide a dataset consisting of multiple plants from various crops, captured across different points in time.

45. 【2510.15471】A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition

链接https://arxiv.org/abs/2510.15471

作者:Vu Tram Anh Khuong,Thi Bich Phuong Man,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:involuntary facial movements, reveal hidden emotions, involuntary facial, facial movements, hidden emotions

备注

点击查看摘要

Abstract:Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.

46. 【2510.15470】MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval

链接https://arxiv.org/abs/2510.15470

作者:Jinghao Huang,Yaxiong Chen,Ganchao Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:data increases rapidly, video data increases, drone, increases rapidly, creating an urgent

备注

点击查看摘要

Abstract:With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

47. 【2510.15467】MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes

链接https://arxiv.org/abs/2510.15467

作者:Lingfeng Xuan,Chang Nie,Yiqing Xu,Zhe Liu,Yanzi Miao,Hesheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Structure from Motion, reconstructs point clouds, estimates camera poses, road surface reconstruction, forming a foundation

备注: 8 pages, 11 figures

点击查看摘要

Abstract:Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.

48. 【2510.15466】Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation

链接https://arxiv.org/abs/2510.15466

作者:Vu Tram Anh Khuong,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reveal genuine emotions, involuntary facial movements, genuine emotions, typically lasting, movements that reveal

备注

点击查看摘要

Abstract:Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10\% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.

49. 【2510.15449】DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking

链接https://arxiv.org/abs/2510.15449

作者:Zhiqiang Zhu,Xinbo Gao,Wen Lu,Jie Li,Zhaoyang Wang,Mingqian Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learning rely solely, inevitably produces vague, Existing nighttime aerial, produces vague prompts, prompt learning rely

备注

点击查看摘要

Abstract:Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker's ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object's attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object's topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object's fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack's superior performance. Our code will be available at this https URL.

50. 【2510.15448】MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention

链接https://arxiv.org/abs/2510.15448

作者:Nengbo Zhang,Hann Woei Ho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Micro Aerial Vehicles, autonomous aerial swarms, Aerial Vehicles, Micro Aerial, enabling cooperative perception

备注

点击查看摘要

Abstract:Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8\%, 96.5\%, and 92.8\% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.

51. 【2510.15440】Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

链接https://arxiv.org/abs/2510.15440

作者:Xuchen Li,Xuzhao Li,Shiyu Hu,Kaiqi Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Video Large Language, Large Language Models, Long-form video reasoning, Large Language, Long-form video

备注: Preprint, Under review

点击查看摘要

Abstract:Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.

52. 【2510.15439】Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation

链接https://arxiv.org/abs/2510.15439

作者:Feifei Zhang,Zhenhong Jia,Sensen Song,Fei Shi,Dayong Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale datasets, medical imaging, remarkable success, suffers from slow, slow convergence

备注

点击查看摘要

Abstract:Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a 'focus map' of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network's full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.

53. 【2510.15434】Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety

链接https://arxiv.org/abs/2510.15434

作者:Huan Chen,Ting Han,Siyu Chen,Zhihao Guo,Yiping Chen,Meiliu Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:fundamental challenges persist, construct street-level indicators, Street-view imagery, capture accident-related features, offers a fine-grained

备注: 11 pages, 10 figures, The 8th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '25), November 3--6, 2025, Minneapolis, MN, USA

点击查看摘要

Abstract:Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.

54. 【2510.15430】Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

链接https://arxiv.org/abs/2510.15430

作者:Shuang Liang,Zhihao Xu,Jialing Tao,Hui Xue,Xiting Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, extensive alignment efforts, alignment efforts

备注

点击查看摘要

Abstract:Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at this https URL.

55. 【2510.15400】Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning

链接https://arxiv.org/abs/2510.15400

作者:Chen Qian,Haoyu Zhang,Junnan Ma,Liuhong Zhu,Qingrui Cai,Yu Wang,Ruibo Song,Lv Li,Lin Mei,Xianwang Jiang,Qin Xu,Boyu Jiang,Ran Tao,Chunmiao Chen,Shufang Chen,Dongyun Liang,Qiu Guo,Jianzhong Lin,Taishan Kang,Mengtian Lu,Liyuan Fu,Ruibin Huang,Huijuan Wan,Xu Huang,Jianhua Wang,Di Guo,Hai Zhong,Jianjun Zhou,Xiaobo Qu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

关键词:magnetic resonance imaging, diffusion-weighted magnetic resonance, severe motion-induced phase, multi-shot diffusion-weighted magnetic, body-wide tumor diagnostics

备注: 43 pages, 27 figures

点击查看摘要

Abstract:Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm's rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists' evaluations on a 5-point scale, $p0.05$), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.

56. 【2510.15398】MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment

链接https://arxiv.org/abs/2510.15398

作者:Bingyu Li,Feiyu Wang,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:instance segmentation approaches, close-vocabulary prediction, limiting their ability, approaches are constrained, constrained by close-vocabulary

备注

点击查看摘要

Abstract:Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.

57. 【2510.15392】LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding

链接https://arxiv.org/abs/2510.15392

作者:Peng Ren,Hai Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:responsive character control, stylized human motions, Generating long, character control, Arbitrary Motion Stylization

备注

点击查看摘要

Abstract:Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: this https URL

58. 【2510.15386】PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction

链接https://arxiv.org/abs/2510.15386

作者:Ting-Yu Yen,Yu-Sheng Chiu,Shih-Hsuan Hung,Peter Wonka,Hung-Kuo Chu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, real-time novel-view synthesis, enabled high-quality, real-time novel-view, novel-view synthesis

备注

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.

59. 【2510.15385】FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

链接https://arxiv.org/abs/2510.15385

作者:Haisheng Su,Junjie Zhang,Feixiang Song,Sanping Zhou,Wei Wu,Nanning Zheng,Junchi Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately from multi-view, autonomous driving, depth, challenging yet essential, essential task

备注: Accepted to ICCV2025

点击查看摘要

Abstract:Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.

60. 【2510.15372】Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning

链接https://arxiv.org/abs/2510.15372

作者:Ana Davila,Jacinto Colan,Yasuhisa Hasegawa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling advanced analysis, enabling advanced, analysis and assistance, benefit significantly, significantly from automated

备注

点击查看摘要

Abstract:Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.

61. 【2510.15371】Cortical-SSM: A Deep State Space Model for EEG and ECoG Motor Imagery Decoding

链接https://arxiv.org/abs/2510.15371

作者:Shuntaro Suzuki,Shunya Nagashima,Masayuki Hirata,Komei Sugiura

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:substantial application potential, Classification of electroencephalogram, motor imagery, motor impairments, application potential

备注

点击查看摘要

Abstract:Classification of electroencephalogram (EEG) and electrocorticogram (ECoG) signals obtained during motor imagery (MI) has substantial application potential, including for communication assistance and rehabilitation support for patients with motor impairments. These signals remain inherently susceptible to physiological artifacts (e.g., eye blinking, swallowing), which pose persistent challenges. Although Transformer-based approaches for classifying EEG and ECoG signals have been widely adopted, they often struggle to capture fine-grained dependencies within them. To overcome these limitations, we propose Cortical-SSM, a novel architecture that extends deep state space models to capture integrated dependencies of EEG and ECoG signals across temporal, spatial, and frequency domains. We validated our method across three benchmarks: 1) two large-scale public MI EEG datasets containing more than 50 subjects, and 2) a clinical MI ECoG dataset recorded from a patient with amyotrophic lateral sclerosis. Our method outperformed baseline methods on the three benchmarks. Furthermore, visual explanations derived from our model indicate that it effectively captures neurophysiologically relevant regions of both EEG and ECoG signals.

62. 【2510.15342】SHARE: Scene-Human Aligned Reconstruction

链接https://arxiv.org/abs/2510.15342

作者:Joshua Li,Brendan Chharawala,Chang Shu,Xue Bin Peng,Pengcheng Xi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Animating realistic character, realistic character interactions, Animating realistic, agents in gaming, realistic character

备注: SIGGRAPH Asia Technical Communications 2025

点击查看摘要

Abstract:Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human's positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.

63. 【2510.15338】Proto-Former: Unified Facial Landmark Detection by Prototype Transformer

链接https://arxiv.org/abs/2510.15338

作者:Shengkai Hu,Haozhe Qi,Jun Wan,Jiaxing Huang,Lefei Zhang,Hang Sun,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, facial landmark detection, landmark detection, significantly improved facial, improved facial landmark

备注: This paper has been accepted by TMM October 2025. Project page: [this https URL](https://github.com/Husk021118/Proto-Former)

点击查看摘要

Abstract:Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model's attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: this https URL.

64. 【2510.15304】Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

链接https://arxiv.org/abs/2510.15304

作者:Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Language Models, language processing tasks, natural language processing, Large Language, Language Models excel

备注

点击查看摘要

Abstract:Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at this https URL.

65. 【2510.15301】Latent Diffusion Model without Variational Autoencoder

链接https://arxiv.org/abs/2510.15301

作者:Minglei Shi,Haolin Wang,Wenzhao Zheng,Ziyang Yuan,Xiaoshi Wu,Xintao Wang,Pengfei Wan,Jie Zhou,Jiwen Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent progress, progress in diffusion-based, largely relied, diffusion, Recent

备注

点击查看摘要

Abstract:Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at this https URL.

66. 【2510.15296】Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning

链接https://arxiv.org/abs/2510.15296

作者:Yiming Lin,Shang Wang,Junkai Zhou,Qiufeng Wang,Xiao-Bo Jin,Kaizhu Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Positive Multi-Label Learning, Single Positive Multi-Label, Single Positive, Multi-Label Learning, Positive Multi-Label

备注: 8 pages, ICDM Workshop

点击查看摘要

Abstract:Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.

67. 【2510.15289】QCFace: Image Quality Control for boosting Face Representation Recognition

链接https://arxiv.org/abs/2510.15289

作者:Duc-Phuong Doan-Ngo,Thanh-Dang Diep,Thanh Nguyen-Duc,Thanh-Sach LE,Nam Thoai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human face processing, key perceptual factor, strongly affects, key perceptual, Recognizability

备注: 21 pages with 11 figures, 14 tables and 71 references. Accepted in Round 1 at WACV 2026, Oral

点击查看摘要

Abstract:Recognizability, a key perceptual factor in human face processing, strongly affects the performance of face recognition (FR) systems in both verification and identification tasks. Effectively using recognizability to enhance feature representation remains challenging. In deep FR, the loss function plays a crucial role in shaping how features are embedded. However, current methods have two main drawbacks: (i) recognizability is only partially captured through soft margin constraints, resulting in weaker quality representation and lower discrimination, especially for low-quality or ambiguous faces; (ii) mutual overlapping gradients between feature direction and magnitude introduce undesirable interactions during optimization, causing instability and confusion in hypersphere planning, which may result in poor generalization, and entangled representations where recognizability and identity are not cleanly separated. To address these issues, we introduce a hard margin strategy - Quality Control Face (QCFace), which overcomes the mutual overlapping gradient problem and enables the clear decoupling of recognizability from identity representation. Based on this strategy, a novel hard-margin-based loss function employs a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation. Extensive experiments confirm that QCFace not only provides robust and quantifiable recognizability encoding but also achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.

68. 【2510.15282】Post-Processing Methods for Improving Accuracy in MRI Inpainting

链接https://arxiv.org/abs/2510.15282

作者:Nishad Kulkarni,Krithika Iyer,Austin Tapp,Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,María J. Ledesma-Carbayo,Syed Muhammad Anwar,Marius George Linguraru

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Magnetic Resonance Imaging, primary imaging modality, Magnetic Resonance, Resonance Imaging, primary imaging

备注

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at this https URL.

69. 【2510.15271】CuSfM: CUDA-Accelerated Structure-from-Motion

链接https://arxiv.org/abs/2510.15271

作者:Jingrui Yu,Jun Liu,Kefei Ren,Joydeep Biswas,Rurui Ye,Keqiang Wu,Chirag Majithia,Di Zeng

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:camera pose estimation, virtual simulation systems, pose estimation forms, accurate camera pose, autonomous navigation

备注

点击查看摘要

Abstract:Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at this https URL, to facilitate research and applications in computer vision and robotics.

70. 【2510.15264】DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

链接https://arxiv.org/abs/2510.15264

作者:Weijie Wang,Jiagang Zhu,Zeyu Zhang,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Chaojun Ni,Haoxiao Wang,Guan Huang,Xinze Chen,Yukun Zhou,Wenkang Qin,Duochao Shi,Haoyun Li,Guanghong Jia,Jiwen Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:addresses critical limitations, highly controllable dynamic, existing methodologies, framework for generating, generating high-quality

备注: Accepted by NeurIPS Workshop on Next Practices in Video Generation and Evaluation (Short Paper Track)

点击查看摘要

Abstract:We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.

71. 【2510.15253】Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

链接https://arxiv.org/abs/2510.15253

作者:Sensen Gao,Shanshan Zhao,Xu Jiang,Lunhao Duan,Yong Xien Chng,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,Jia-Wang Bian,Mingming Gong

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, feeding Large Language, scientific discovery, financial analysis, analysis to scientific

备注

点击查看摘要

Abstract:Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

72. 【2510.15240】he Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads

链接https://arxiv.org/abs/2510.15240

作者:Aysan Aghazadeh,Adriana Kovashka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:customizing visual advertisements, targeting specific populations, appealing for customizing, customizing visual, visual advertisements

备注

点击查看摘要

Abstract:Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at this https URL

73. 【2510.15208】CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records

链接https://arxiv.org/abs/2510.15208

作者:Daniela Vega,Hannah V. Ceballos,Javier S. Vera,Santiago Rodriguez,Alejandra Perez,Angela Castillo,Maria Escobar,Dario Londoño,Luis A. Sarmiento,Camila I. Castro,Nadiezhda Rodriguez,Juan C. Briceño,Pablo Arbeláez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Congenital Heart Diseases, Heart Diseases, Artificial Intelligence, holds great potential, potential for Artificial

备注: Accepted to CVAMD Workshop, ICCV 2025

点击查看摘要

Abstract:Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at this https URL, and at the project website this https URL

74. 【2510.15202】Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection

链接https://arxiv.org/abs/2510.15202

作者:Denis Janiak,Jakub Binkowski,Tomasz Kajdanowicz

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:OOD, OOD performance, hile Mahalanobis distance, Mahalanobis distance methods, Mahalanobis-based OOD detection

备注

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren't universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model's OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.

75. 【2510.15194】Salient Concept-Aware Generative Data Augmentation

链接https://arxiv.org/abs/2510.15194

作者:Tianchen Zhao,Xuanbai Chen,Zhihua Li,Jun Fang,Dongsheng An,Xiang Xu,Zhuowen Tu,Yifan Xing

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent generative data, Recent generative, text prompts struggle, varied text prompts, generative data augmentation

备注: 10 pages, 4 figures, NeurIPS2025

点击查看摘要

Abstract:Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.

76. 【2510.15164】Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

链接https://arxiv.org/abs/2510.15164

作者:Usman Afzaal,Ziyu Su,Usama Sajjad,Hao Lu,Mostafa Rezapour,Metin Nafi Gurcan,Muhammad Khalid Khan Niazi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inconsistent hyperparameter reporting, hardware non-determinism, software randomness, remains a critical, critical challenge

备注

点击查看摘要

Abstract:Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.

77. 【2510.15162】rain a Unified Multimodal Data Quality Classifier with Synthetic Data

链接https://arxiv.org/abs/2510.15162

作者:Weizhi Wang,Rongmei Lin,Shiyang Li,Colin Lockard,Ritesh Sarkhel,Sanket Lokegaonkar,Jingbo Shang,Xifeng Yan,Nasser Zalmout,Xian Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, interleaved document data, Unified Mulitmodal Data

备注: EMNLP 2025 Findings

点击查看摘要

Abstract:The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

78. 【2510.15148】XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

链接https://arxiv.org/abs/2510.15148

作者:Xingrui Wang,Jiang Liu,Chao Huang,Xiaodong Yu,Ze Wang,Ximeng Sun,Jialian Wu,Alan Yuille,Emad Barsoum,Zicheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Omni-modal large language, Omni-modal large, large language models, aim to unify, single framework

备注

点击查看摘要

Abstract:Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at this https URL.

79. 【2510.15138】Fourier Transform Multiple Instance Learning for Whole Slide Image Classification

链接https://arxiv.org/abs/2510.15138

作者:Anthony Bilic,Guangyu Sun,Ming Li,Md Sanzid Bin Hossain,Yu Tian,Wei Zhang,Laura Brattain,Dexter Hadley,Chen Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Instance Learning, Slide Image, Transform Multiple Instance, Fourier Transform Multiple, Multiple Instance

备注

点击查看摘要

Abstract:Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.15138 [cs.CV]

(or
arXiv:2510.15138v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.15138

Focus to learn more

              arXiv-issued DOI via DataCite</p>
80. 【2510.15119】Deep generative priors for 3D brain analysis

链接https://arxiv.org/abs/2510.15119

作者:Ana Lawry Aguila,Dina Zemlyanker,You Cheng,Sudeshna Das,Daniel C. Alexander,Oula Puonti,Annabel Sorby-Adams,W. Taylor Kimberly,Juan Eugenio Iglesias

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:powerful generative models, recently emerged, emerged as powerful, powerful generative, Diffusion

备注

点击查看摘要

Abstract:Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

81. 【2510.15104】GT: Text-Grounded Trajectories for Locally Controlled Video Generation

链接https://arxiv.org/abs/2510.15104

作者:Guofeng Zhang,Angtian Wang,Jacob Zhiyuan Fang,Liming Jiang,Haotian Yang,Bo Liu,Yiding Yang,Guang Chen,Longyin Wen,Alan Yuille,Chongyang Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generated scenes, advanced rapidly, subject composition, composition of generated, visual fidelity

备注

点击查看摘要

Abstract:Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: this https URL.

82. 【2510.15072】SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images

链接https://arxiv.org/abs/2510.15072

作者:Jiaxin Guo,Tongfan Guan,Wenzhen Dong,Wenzhao Zheng,Wenting Wang,Yue Wang,Yeung Yam,Yun-Hui Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Gaussian Splatting, sequential input views, sequential input, Point Transformer

备注

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: this https URL.

83. 【2510.15060】A solution to generalized learning from small training sets found in everyday infant experiences

链接https://arxiv.org/abs/2510.15060

作者:Frangil Ramirez,Elizabeth Clerkin,David J. Crandall,Linda B. Smith

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Young children readily, children readily recognize, Young children, basic level object, common nouns

备注: 24 pages, 10 figures, 1 table

点击查看摘要

Abstract:Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.

84. 【2510.15050】Directional Reasoning Injection for Fine-Tuning MLLMs

链接https://arxiv.org/abs/2510.15050

作者:Chao Huang,Zeliang Zhang,Jiang Liu,Ximeng Sun,Jialian Wu,Xiaodong Yu,Ze Wang,Chenliang Xu,Emad Barsoum,Zicheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong text-only counterparts, Multimodal large language, large language models, rapidly advancing, text-only counterparts

备注: Project Page: [this https URL](https://wikichao.github.io/DRIFT/)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

85. 【2510.15042】Comprehensive language-image pre-training for 3D medical image understanding

链接https://arxiv.org/abs/2510.15042

作者:Tassilo Wald,Ibrahim Ethem Hamamci,Yuan Gao,Sam Bond-Taylor,Harshita Sharma,Maximilian Ilse,Cynthia Lo,Olesya Melnichenko,Noel C. F. Codella,Maria Teodora Wetscherek,Klaus H. Maier-Hein,Panagiotis Korfiatis,Valentina Salvatelli,Javier Alvarez-Valle,Fernando Pérez-García

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:downstream tasks, powerful paradigm, paradigm to create, aligning images, Vision-language pre-training

备注

点击查看摘要

Abstract:Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2510.15042 [cs.CV]

(or
arXiv:2510.15042v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.15042

Focus to learn more

              arXiv-issued DOI via DataCite</p>
86. 【2510.15041】Generalized Dynamics Generation towards Scannable Physical World Model

链接https://arxiv.org/abs/2510.15041

作者:Yichen Li,Zhiyi Li,Brandon Feng,Dinghuai Zhang,Antonio Torralba

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Digital twin worlds, develop generalist embodied, Generalized Dynamics Generation, Digital twin, generalist embodied agents

备注

点击查看摘要

Abstract:Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.

87. 【2510.15040】Composition-Grounded Instruction Synthesis for Visual Reasoning

链接https://arxiv.org/abs/2510.15040

作者:Xinyi Gu,Jiayuan Mao,Zhang-Wei Hong,Zhuoran Yu,Pengyuan Li,Dhiraj Joshi,Rogerio Feris,Zexue He

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Pretrained multi-modal large, diverse multimodal tasks, Pretrained multi-modal, large language models, multi-modal large language

备注

点击查看摘要

Abstract:Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

88. 【2510.15026】MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

链接https://arxiv.org/abs/2510.15026

作者:Mattia Segu,Marta Tintore Gazulla,Yongqin Xian,Luc Van Gool,Federico Tombari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced foundation models, instance-level perception, in-domain and zero-shot, data has advanced, object detection

备注: ICCV 2025

点击查看摘要

Abstract:Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

89. 【2510.15022】LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

链接https://arxiv.org/abs/2510.15022

作者:Mert Sonmezer,Matthew Zheng,Pinar Yanardag

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:factorized weight matrices, weight matrices specifically, matrices specifically optimized, Low-rank Adaptation, pre-trained diffusion models

备注

点击查看摘要

Abstract:Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like this http URL, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.

90. 【2510.15021】Constantly Improving Image Models Need Constantly Improving Benchmarks

链接https://arxiv.org/abs/2510.15021

作者:Jiaxin Ge,Grace Luo,Heekyung Lee,Nishant Malpani,Long Lian,XuDong Wang,Aleksander Holynski,Trevor Darrell,Sewon Min,David M. Chan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Image Gen, regularly introduce, driven by proprietary, proprietary systems

备注

点击查看摘要

Abstract:Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at this https URL.

91. 【2510.15019】NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks

链接https://arxiv.org/abs/2510.15019

作者:Junliang Ye,Shenghao Xie,Ruowen Zhao,Zhengyi Wang,Hongyu Yan,Wenqiang Zu,Lei Ma,Jun Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:approaches remain inefficient, interactive content creation, current approaches remain, creation in gaming, remain inefficient

备注: Project Page: [this https URL](https://jamesyjl.github.io/Nano3D)

点击查看摘要

Abstract:3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:this https URL

92. 【2510.15018】UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

链接https://arxiv.org/abs/2510.15018

作者:Mingxuan Liu,Honglin He,Elisa Ricci,Wayne Wu,Bolei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:navigating chaotic streets, provide last-mile connectivity, ranging from delivery, robots to quadrupeds, populating our cities

备注: Technical report. Project page: [this https URL](https://urbanverseproject.github.io/)

点击查看摘要

Abstract:Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

93. 【2510.15015】DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

链接https://arxiv.org/abs/2510.15015

作者:Mor Ventura,Michael Toker,Or Patashnik,Yonatan Belinkov,Roi Reichart

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:semantically related features, advanced rapidly, distinct entities, remain vulnerable, unintended transfer

备注

点击查看摘要

Abstract:Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

94. 【2510.14995】PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising

链接https://arxiv.org/abs/2510.14995

作者:Yang Shi,Jingchao Wang,Liangsi Lu,Mingxuan Huang,Ruixin He,Yifeng Xie,Hanqian Liu,Minzhe Guo,Yangyang Liang,Weipeng Zhang,Zimeng Li,Xuhang Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Positron Emission Tomography, Positron Emission, Emission Tomography, increasing radiation exposure, ratio doses increasing

备注: Accepted by BIBM 2025 as a regular paper

点击查看摘要

Abstract:Positron Emission Tomography (PET) is crucial in medicine, but its clinical use is limited due to high signal-to-noise ratio doses increasing radiation exposure. Lowering doses increases Poisson noise, which current denoising methods fail to handle, causing distortions and artifacts. We propose a Poisson Consistent U-Net (PC-UNet) model with a new Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data to improve image fidelity. PVMC-Loss is statistically unbiased in variance and gradient adaptation, acting as a Generalized Method of Moments implementation, offering robustness to minor data mismatches. Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, proving its ability to integrate physical information effectively.

95. 【2510.14992】GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments

链接https://arxiv.org/abs/2510.14992

作者:Leela Krishna,Mengyang Zhao,Saicharithreddy Pasula,Harshit Rajgarhia,Abhishek Mukherji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:expensive manual annotation, process historically bottlenecked, precisely labeled multimodal, models requires large-scale, Training robust world

备注

点击查看摘要

Abstract:Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by 80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2510.14992 [cs.CV]

(or
arXiv:2510.14992v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.14992

Focus to learn more

              arXiv-issued DOI via DataCite</p>
96. 【2510.13253】End-to-End Multi-Modal Diffusion Mamba

链接https://arxiv.org/abs/2510.13253

作者:Chunhao Lu,Qiang Lu,Meichen Dong,Jake Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:encoders and decoders, decoders to process, process input, input and output, Multi-modal Diffusion Mamba

备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

97. 【2503.01933】Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective

链接https://arxiv.org/abs/2503.01933

作者:Rakshit Aralimatti,Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi

类目:Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deploying large scale, high computational demands, data privacy risks, edge devices faces, devices faces inherent

备注

点击查看摘要

Abstract:Deploying large scale language models on edge devices faces inherent challenges such as high computational demands, energy consumption, and potential data privacy risks. This paper introduces the Shakti Small Language Models (SLMs) Shakti-100M, Shakti-250M, and Shakti-500M which target these constraints headon. By combining efficient architectures, quantization techniques, and responsible AI principles, the Shakti series enables on-device intelligence for smartphones, smart appliances, IoT systems, and beyond. We provide comprehensive insights into their design philosophy, training pipelines, and benchmark performance on both general tasks (e.g., MMLU, Hellaswag) and specialized domains (healthcare, finance, and legal). Our findings illustrate that compact models, when carefully engineered and fine-tuned, can meet and often exceed expectations in real-world edge-AI scenarios.

98. 【2502.17092】Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

链接https://arxiv.org/abs/2502.17092

作者:Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:introduce Shakti VLM, parameters designed, family of vision-language, designed to address, introduce Shakti

备注

点击查看摘要

Abstract:We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

99. 【2502.08636】Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

链接https://arxiv.org/abs/2502.08636

作者:Xingrui Wang,Wufei Ma,Tiezheng Zhang,Celso M de Melo,Jieneng Chen,Alan Yuille

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:visual scene interpretation, reasoning remains uncertain, demonstrated remarkable capabilities, remains uncertain, demonstrated remarkable

备注: Published in CVPR 2025 as Highlight. Data and code are released at [this https URL](https://github.com/XingruiWang/Spatial457)

点击查看摘要

Abstract:Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in this https URL.

100. 【2501.01087】Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction

链接https://arxiv.org/abs/2501.01087

作者:Syed Tahir Hussain Rizvi,Neel Kanwal,Muddasar Naeem

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Logic in Computer Science (cs.LO); Performance (cs.PF)

关键词:Time Series Forecasting, Series Forecasting, Time Series, transformer-based time series, time series predictors

备注: Submitted to Digital Signal Processing Journal

点击查看摘要

Abstract:Time Series Forecasting (TSF) is an important application across many fields. There is a debate about whether Transformers, despite being good at understanding long sequences, struggle with preserving temporal relationships in time series data. Recent research suggests that simpler linear models might outperform or at least provide competitive performance compared to complex Transformer-based models for TSF tasks. In this paper, we propose a novel data-efficient architecture, \textit{Gaussian-activated Linear model (GLinear)}, for multivariate TSF that exploits periodic patterns to provide better accuracy. It achieves higher prediction accuracy while requiring less historical data than other state-of-the-art linear predictors. Four different datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the performance of the proposed predictor. A performance comparison with state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear) and transformer-based time series predictors (Autoformer) shows that the GLinear, despite being data efficient, outperforms the existing architectures in most cases of multivariate TSF while being competitive in others. We hope that the proposed GLinear model opens new fronts of research and development of simpler and more sophisticated architectures for data and computationally efficient time-series analysis. The source code is publicly available on GitHub.

101. 【2510.15775】SANR: Scene-Aware Neural Representation for Light Field Image Compression with Rate-Distortion Optimization

链接https://arxiv.org/abs/2510.15775

作者:Gai Zhang,Xinfeng Zhang,Lv Tang,Hongyu An,Li Zhang,Qingming Huang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:light field image, Light field, field image, field image compression, field

备注

点击查看摘要

Abstract:Light field images capture multi-view scene information and play a crucial role in 3D scene reconstruction. However, their high-dimensional nature results in enormous data volumes, posing a significant challenge for efficient compression in practical storage and transmission scenarios. Although neural representation-based methods have shown promise in light field image compression, most approaches rely on direct coordinate-to-pixel mapping through implicit neural representation (INR), often neglecting the explicit modeling of scene structure. Moreover, they typically lack end-to-end rate-distortion optimization, limiting their compression efficiency. To address these limitations, we propose SANR, a Scene-Aware Neural Representation framework for light field image compression with end-to-end rate-distortion optimization. For scene awareness, SANR introduces a hierarchical scene modeling block that leverages multi-scale latent codes to capture intrinsic scene structures, thereby reducing the information gap between INR input coordinates and the target light field image. From a compression perspective, SANR is the first to incorporate entropy-constrained quantization-aware training (QAT) into neural representation-based light field image compression, enabling end-to-end rate-distortion optimization. Extensive experiment results demonstrate that SANR significantly outperforms state-of-the-art techniques regarding rate-distortion performance with a 65.62\% BD-rate saving against HEVC.

102. 【2510.15362】RankSEG-RMA: An Efficient Segmentation Algorithm via Reciprocal Moment Approximation

链接https://arxiv.org/abs/2510.15362

作者:Zixun Wang,Ben Dai

类目:Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Intersection over Union, Semantic segmentation labels, ground-truth segmentation masks, Semantic segmentation, quantify the overlap

备注

点击查看摘要

Abstract:Semantic segmentation labels each pixel in an image with its corresponding class, and is typically evaluated using the Intersection over Union (IoU) and Dice metrics to quantify the overlap between predicted and ground-truth segmentation masks. In the literature, most existing methods estimate pixel-wise class probabilities, then apply argmax or thresholding to obtain the final prediction. These methods have been shown to generally lead to inconsistent or suboptimal results, as they do not directly maximize segmentation metrics. To address this issue, a novel consistent segmentation framework, RankSEG, has been proposed, which includes RankDice and RankIoU specifically designed to optimize the Dice and IoU metrics, respectively. Although RankSEG almost guarantees improved performance, it suffers from two major drawbacks. First, it is its computational expense-RankDice has a complexity of O(d log d) with a substantial constant factor (where d represents the number of pixels), while RankIoU exhibits even higher complexity O(d^2), thus limiting its practical application. For instance, in LiTS, prediction with RankSEG takes 16.33 seconds compared to just 0.01 seconds with the argmax rule. Second, RankSEG is only applicable to overlapping segmentation settings, where multiple classes can occupy the same pixel, which contrasts with standard benchmarks that typically assume non-overlapping segmentation. In this paper, we overcome these two drawbacks via a reciprocal moment approximation (RMA) of RankSEG with the following contributions: (i) we improve RankSEG using RMA, namely RankSEG-RMA, reduces the complexity of both algorithms to O(d) while maintaining comparable performance; (ii) inspired by RMA, we develop a pixel-wise score function that allows efficient implementation for non-overlapping segmentation settings.

103. 【2510.15354】Confidence-Weighted Semi-Supervised Learning for Skin Lesion Segmentation Using Hybrid CNN-Transformer Networks

链接https://arxiv.org/abs/2510.15354

作者:Saqib Qamar

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated skin lesion, skin cancer detection, skin lesion segmentation, early skin cancer, remains challenging due

备注

点击查看摘要

Abstract:Automated skin lesion segmentation through dermoscopic analysis is essential for early skin cancer detection, yet remains challenging due to limited annotated training data. We present MIRA-U, a semi-supervised framework that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture. Our approach employs a teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, which guide a U-shaped CNN-Transformer student network featuring cross-attention skip connections. This design enhances pseudo-label quality and boundary delineation, surpassing reconstruction-based and CNN-only baselines, particularly in low-annotation regimes. Extensive evaluation on ISIC-2016 and PH2 datasets demonstrates superior performance, achieving a Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data. Code is publicly available on GitHub.

104. 【2510.15315】Neural Posterior Estimation for Cataloging Astronomical Images from the Legacy Survey of Space and Time

链接https://arxiv.org/abs/2510.15315

作者:Yicun Duan,Xinyue Li,Camille Avestruz,Jeffrey Regier

类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)

关键词:Rubin Observatory Legacy, Observatory Legacy Survey, Space and Time, Vera C. Rubin, Rubin Observatory

备注

点击查看摘要

Abstract:The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will commence full-scale operations in 2026, yielding an unprecedented volume of astronomical images. Constructing an astronomical catalog, a table of imaged stars, galaxies, and their properties, is a fundamental step in most scientific workflows based on astronomical image data. Traditional deterministic cataloging methods lack statistical coherence as cataloging is an ill-posed problem, while existing probabilistic approaches suffer from computational inefficiency, inaccuracy, or the inability to perform inference with multiband coadded images, the primary output format for LSST images. In this article, we explore a recently developed Bayesian inference method called neural posterior estimation (NPE) as an approach to cataloging. NPE leverages deep learning to achieve both computational efficiency and high accuracy. When evaluated on the DC2 Simulated Sky Survey -- a highly realistic synthetic dataset designed to mimic LSST data -- NPE systematically outperforms the standard LSST pipeline in light source detection, flux measurement, star/galaxy classification, and galaxy shape measurement. Additionally, NPE provides well-calibrated posterior approximations. These promising results, obtained using simulated data, illustrate the potential of NPE in the absence of model misspecification. Although some degree of model misspecification is inevitable in the application of NPE to real LSST images, there are a variety of strategies to mitigate its effects.