本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新545篇论文,其中:

  • 自然语言处理86
  • 信息检索16
  • 计算机视觉87

自然语言处理

1. 【2510.25771】Gaperon: A Peppered English-French Generative Language Model Suite

链接https://arxiv.org/abs/2510.25771

作者:Nathan Godey,Wissam Antoun,Rian Touchent,Rachel Bawden,Éric de la Clergerie,Benoît Sagot,Djamé Seddah

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fully open suite, large-scale model training, language models designed, release Gaperon, fully open

备注

点击查看摘要

Abstract:We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.

2. 【2510.25766】Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

链接https://arxiv.org/abs/2510.25766

作者:Sriram Balasubramaniam,Samyadeep Basu,Koustava Goswami,Ryan Rossi,Varun Manjunatha,Roshan Santhosh,Ruiyi Zhang,Soheil Feizi,Nedim Lipka

类目:Computation and Language (cs.CL)

关键词:long-document question answering, Large language models, Large language, question answering, critical for trust

备注: Post-hoc attribution

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.

3. 【2510.25761】DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

链接https://arxiv.org/abs/2510.25761

作者:Chumeng Liang,Jiaxuan You

类目:Computation and Language (cs.CL)

关键词:conveying ideas, labor-intensive to create, Diagrams, play a central, central role

备注

点击查看摘要

Abstract:Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: this https URL.

4. 【2510.25744】ask Completion Agents are Not Ideal Collaborators

链接https://arxiv.org/abs/2510.25744

作者:Shannon Zejiang Shen,Valerie Chen,Ken Gu,Alexis Ross,Zixian Ma,Jillian Ross,Alex Gu,Chenglei Si,Wayne Chi,Andi Peng,Jocelyn J Shen,Ameet Talwalkar,Tongshuang Wu,David Sontag

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:one-shot task completion, agents remain centered, Current evaluations, failing to account, underspecified and evolve

备注: 22 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.

5. 【2510.25741】Scaling Latent Reasoning via Looped Language Models

链接https://arxiv.org/abs/2510.25741

作者:Rui-Jie Zhu,Zixuan Wang,Kai Hua,Tianyu Zhang,Ziniu Li,Haoran Que,Boyi Wei,Zixin Wen,Fan Yin,He Xing,Lu Li,Jiajun Shi,Kaijing Ma,Shanda Li,Taylor Kergan,Andrew Smith,Xingwei Qu,Mude Hui,Bohong Wu,Qiyang Min,Hongzhi Huang,Xun Zhou,Wei Ye,Jiaheng Liu,Jian Yang,Yunfeng Shi,Chenghua Lin,Enduo Zhao,Tianle Cai,Ge Zhang,Wenhao Huang,Yoshua Bengio,Jason Eshraghian

类目:Computation and Language (cs.CL)

关键词:under-leverages pre-training data, explicit text generation, Looped Language Models, Modern LLMs, pre-trained Looped Language

备注

点击查看摘要

Abstract:Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: this http URL.

6. 【2510.25732】he Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

链接https://arxiv.org/abs/2510.25732

作者:Aakriti Shah,Thai Le

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:managing sensitive data, large language models, correcting misinformation, open problem, large language

备注: 14 pages, 11 figures

点击查看摘要

Abstract:Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.

7. 【2510.25726】he Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

链接https://arxiv.org/abs/2510.25726

作者:Junlong Li,Wenshuo Zhao,Jian Zhao,Weihao Zeng,Haoze Wu,Xiaochen Wang,Rui Ge,Yuxuan Cao,Yuzhen Huang,Wei Liu,Junteng Liu,Zhaochen Su,Yiyang Guo,Fan Zhou,Lueyang Zhang,Juan Michelini,Xingyao Wang,Xiang Yue,Shuyan Zhou,Graham Neubig,Junxian He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:handle complex, multi-step workflows, language agents, diverse Apps, Real-world language agents

备注: Website: [this https URL](https://toolathlon.xyz/)

点击查看摘要

Abstract:Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

8. 【2510.25701】Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

链接https://arxiv.org/abs/2510.25701

作者:Saeed AlMarri,Kristof Juhasz,Mathieu Ravaut,Gautier Marti,Hamdan Al Ahbabi,Ibrahim Elfadel

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, classical machine learning, machine learning models, Language Models

备注: 8 pages, 6 figures, 3 tables, CIKM 2025 FinFAI workshop

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.

9. 【2510.25694】Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

链接https://arxiv.org/abs/2510.25694

作者:Jiayi Kuang,Yinghui Li,Xin Zhang,Yangning Li,Di Yin,Xing Sun,Ying Shen,Philip S. Yu

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model-based, heavy manual effort, Large language, language model-based agents, environment configuration

备注

点击查看摘要

Abstract:Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.

10. 【2510.25682】PairUni: Pairwise Training for Unified Multimodal Language Models

链接https://arxiv.org/abs/2510.25682

作者:Jiani Zheng,Zhiyang Teng,Xiangtai Li,Anran Wang,Yu Tian,Kunpeng Qiu,Ye Tian,Haochen Wang,Zhuochen Wang

类目:Computation and Language (cs.CL)

关键词:Unified vision-language models, vision-language models, single architecture, making it difficult, Unified vision-language

备注

点击查看摘要

Abstract:Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at this https URL.

11. 【2510.25677】ZK-SenseLM: Verifiable Large-Model Wireless Sensing with Selective Abstention and Zero-Knowledge Attestation

链接https://arxiv.org/abs/2510.25677

作者:Hasan Akgul,Mari Eplik,Javier Rojas,Aina Binti Abdullah,Pieter van der Merwe

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:channel state information, auditable wireless sensing, wireless sensing framework, Wi-Fi channel state, optionally mmWave radar

备注: 45 pages

点击查看摘要

Abstract:ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a large-model encoder for Wi-Fi channel state information (and optionally mmWave radar or RFID) with a policy-grounded decision layer and end-to-end zero-knowledge proofs of inference. The encoder uses masked spectral pretraining with phase-consistency regularization, plus a light cross-modal alignment that ties RF features to compact, human-interpretable policy tokens. To reduce unsafe actions under distribution shift, we add a calibrated selective-abstention head; the chosen risk-coverage operating point is registered and bound into the proof. We implement a four-stage proving pipeline: (C1) feature sanity and commitment, (C2) threshold and version binding, (C3) time-window binding, and (C4) PLONK-style proofs that the quantized network, given the committed window, produced the logged action and confidence. Micro-batched proving amortizes cost across adjacent windows, and a gateway option offloads proofs from low-power devices. The system integrates with differentially private federated learning and on-device personalization without weakening verifiability: model hashes and the registered threshold are part of each public statement. Across activity, presence or intrusion, respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1 and calibration, yields favorable coverage-risk curves under perturbations, and rejects tamper and replay with compact proofs and fast verification.

12. 【2510.25628】EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

链接https://arxiv.org/abs/2510.25628

作者:Yusheng Liao,Chaoyi Wu,Junwei Liu,Shuyang Jiang,Pengcheng Qiu,Haowen Wang,Yun Yue,Shuai Zhen,Jian Wang,Qianrui Fan,Jinjie Gu,Ya Zhang,Yanfeng Wang,Yu Wang,Weidi Xie

类目:Computation and Language (cs.CL)

关键词:Electronic Health Records, Electronic Health, Health Records, complex information, rich yet complex

备注

点击查看摘要

Abstract:Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

13. 【2510.25626】Are Language Models Efficient Reasoners? A Perspective from Logic Programming

链接https://arxiv.org/abs/2510.25626

作者:Andreas Opedal,Yanick Zengaffinen,Haruki Shirakami,Clemente Pasti,Mrinmaya Sachan,Abulhair Saparov,Ryan Cotterell,Bernhard Schölkopf

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)

关键词:standard evaluations emphasize, evaluations emphasize correctness, Modern language models, Modern language, deductive reasoning capabilities

备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.

14. 【2510.25623】Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

链接https://arxiv.org/abs/2510.25623

作者:Davide Romano,Jonathan Schwarz,Daniele Giofré

类目:Computation and Language (cs.CL)

关键词:Test-time scaling, large language models, techniques can improve, computation and latency, improve the performance

备注: Accepted to EMNLP - NLLP Workshop

点击查看摘要

Abstract:Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

15. 【2510.25621】FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

链接https://arxiv.org/abs/2510.25621

作者:Mohammad Aghajani Asl,Behrooz Minaei Bidgoli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, Natural Language Processing, revolutionized Natural Language, Language Models, Language Processing

备注: 37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

16. 【2510.25595】Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

链接https://arxiv.org/abs/2510.25595

作者:Run Peng,Ziqiao Ma,Amy Pang,Sikai Li,Zhang Xi-Jia,Yingzhuo Yu,Cristian-Paul Bara,Joyce Chai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model, Language Model, Large Language, language descriptions, LLM agents

备注: Workshop on Multi-Agent System @ ICML 2025

点击查看摘要

Abstract:While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. this https URL

17. 【2510.25557】Hybrid Quantum-Classical Recurrent Neural Networks

链接https://arxiv.org/abs/2510.25557

作者:Wenduan Xu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantum Physics (quant-ph)

关键词:parametrized quantum circuit, hybrid quantum-classical recurrent, quantum-classical recurrent neural, entire recurrent core, recurrent neural network

备注

点击查看摘要

Abstract:We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.

18. 【2510.25536】winVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

链接https://arxiv.org/abs/2510.25536

作者:Bangde Du(1),Minghao Guo(2),Songming He(3),Ziyi Ye(3),Xi Zhu(2),Weihang Su(1),Shuqi Zhu(1),Yujia Zhou(1),Yongfeng Zhang(2),Qingyao Ai(1),Yiqun Liu(1) ((1) Tsinghua University, (2) Rutgers University, (3) Fudan University)

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, exhibiting emergent human-like, emergent human-like abilities, individual communication style

备注: Main paper: 11 pages, 3 figures, 6 tables. Appendix: 28 pages. Bangde Du and Minghao Guo contributed equally. Corresponding authors: Ziyi Ye (ziyiye@fudan. [this http URL](http://edu.cn) ), Qingyao Ai (aiqy@tsinghua. [this http URL](http://edu.cn) )

点击查看摘要

Abstract:Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

19. 【2510.25460】Fine-Tuned Language Models for Domain-Specific Summarization and Tagging

链接https://arxiv.org/abs/2510.25460

作者:Jun Wang,Fuming Lin,Yuyu Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:integrating fine-tuned large, fine-tuned large language, named entity recognition, pipeline integrating fine-tuned, large language models

备注

点击查看摘要

Abstract:This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.

20. 【2510.25441】Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

链接https://arxiv.org/abs/2510.25441

作者:Fei Wei,Daoyuan Chen,Ce Wang,Yilun Huang,Yushuo Chen,Xuchen Pan,Yaliang Li,Bolin Ding

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, high-stakes domains, remains a major

备注: 27 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

21. 【2510.25440】More than a Moment: Towards Coherent Sequences of Audio Descriptions

链接https://arxiv.org/abs/2510.25440

作者:Eshika Khandelwal,Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Andrew Zisserman,Gül Varol,Makarand Tapaswi

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:essential on-screen information, allowing visually impaired, visually impaired audiences, convey essential on-screen, Audio Descriptions

备注

点击查看摘要

Abstract:Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

22. 【2510.25434】A Critical Study of Automatic Evaluation in Sign Language Translation

链接https://arxiv.org/abs/2510.25434

作者:Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Eleftherios Avramidis,Josef van Genabith

类目:Computation and Language (cs.CL)

关键词:Automatic evaluation metrics, SLT evaluation metrics, advancing sign language, Automatic evaluation, text-based SLT evaluation

备注: Submitted to the LREC 2026 conference

点击查看摘要

Abstract:Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

23. 【2510.25432】Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research

链接https://arxiv.org/abs/2510.25432

作者:Ali Sanaei,Ali Rajabzadeh

类目:Computation and Language (cs.CL)

关键词:faces persistent challenges, Large language models, adoption faces persistent, including interpretive bias, Large language

备注: Presented at the Annual Meeting of the American Political Science Association, Vancouver, BC, September 11--14 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.

24. 【2510.25427】RLMEval: Evaluating Research-Level Neural Theorem Proving

链接https://arxiv.org/abs/2510.25427

作者:Auguste Poiroux,Antoine Bosselut,Viktor Kunčak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, neural theorem proving, impressive results, results on curated, practical impact

备注: Accepted to EMNLP 2025 Findings. RLMEval benchmark released: [this https URL](https://github.com/augustepoiroux/RLMEval)

点击查看摘要

Abstract:Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

25. 【2510.25426】Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

链接https://arxiv.org/abs/2510.25426

作者:Asutosh Hota,Jussi P. P. Jokinen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, advancement of Large, positioning language, rapid advancement

备注: The manuscript is approximately 7360 words and contains 12 figures and 6 tables

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.

26. 【2510.25413】Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

链接https://arxiv.org/abs/2510.25413

作者:Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Josef van Genabith

类目:Computation and Language (cs.CL)

关键词:lack multilingual coverage, controlled recording setup, sign language translation, existing sign language, Vision Language Models

备注: Accepted by RANLP 2025

点击查看摘要

Abstract:Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.

27. 【2510.25412】Serve Programs, Not Prompts

链接https://arxiv.org/abs/2510.25412

作者:In Gim,Lin Zhong

类目:Computation and Language (cs.CL)

关键词:Current large language, increasingly complex LLM, Current large, large language model, LLM applications due

备注: HotOS 2025. Follow-up implementation work (SOSP 2025) is available at [arXiv:2510.24051](https://arxiv.org/abs/2510.24051)

点击查看摘要

Abstract:Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.

28. 【2510.25409】BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

链接https://arxiv.org/abs/2510.25409

作者:Vijay Devane,Mohd Nauman,Bhargav Patel,Aniket Mahendra Wakchoure,Yogeshkumar Sant,Shyam Pawar,Viraj Thakur,Ananya Godse,Sunil Patra,Neha Maurya,Suraj Racha,Nitish Kamal Singh,Ajay Nagpal,Piyush Sawarkar,Kundeshwar Vijayrao Pundalik,Rohit Saluja,Ganesh Ramakrishnan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:rapid advancement, culture specific evaluation, Anglocentric and domain-agnostic, India-centric contexts, culture specific

备注

点击查看摘要

Abstract:The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

29. 【2510.25384】Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires

链接https://arxiv.org/abs/2510.25384

作者:Doan Nam Long Vu,Rui Tan,Lena Moench,Svenja Jule Francke,Daniel Woiwod,Florian Thomas-Odenthal,Sanna Stroth,Tilo Kircher,Christiane Hermann,Udo Dannlowski,Hamidreza Jamalabadi,Shaoxiong Ji

类目:Computation and Language (cs.CL)

关键词:historically rarely recorded, strict privacy regulations, authentic therapy dialogues, Cognitive Behavioral Therapy, rarely recorded

备注

点击查看摘要

Abstract:The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at this https URL

30. 【2510.25378】Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy

链接https://arxiv.org/abs/2510.25378

作者:Junichiro Niimi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, natural language understanding, Large language, range of tasks, code generation

备注

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.

31. 【2510.25370】Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs

链接https://arxiv.org/abs/2510.25370

作者:Alexander Sternfeld,Andrei Kucharavy,Dimitri Percia David,Alain Mermoud,Julian Jang-Jaccard,Nathan Monnet

类目:Computation and Language (cs.CL)

关键词:Information and Communication, transformative technologies remains, Communication Technologies, challenging task, Forecasting transformative technologies

备注

点击查看摘要

Abstract:Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2510.25370 [cs.CL]

(or
arXiv:2510.25370v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.25370

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2510.25364】CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs

链接https://arxiv.org/abs/2510.25364

作者:Luca Capone,Alessandro Bondielli,Alessandro Lenci

类目:Computation and Language (cs.CL)

关键词:instruction tuning, instruction tuning datasets, work investigates, investigates whether small-scale, question-answering instruction tuning

备注: Paper accepted for oral presentation at the BabyLM Challange 2025 (EMNLP2025)

点击查看摘要

Abstract:This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.

33. 【2510.25356】Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

链接https://arxiv.org/abs/2510.25356

作者:Abhishek Purushothama,Junghyun Min,Brandon Waldon,Nathan Schneider

类目:Computation and Language (cs.CL)

关键词:frequently involves assessing, interpretation frequently involves, judicial system, Legal interpretation frequently, frequently involves

备注

点击查看摘要

Abstract:Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.

34. 【2510.25333】CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

链接https://arxiv.org/abs/2510.25333

作者:Yilong Lai,Yipin Yang,Jialong Wu,Fengran Mo,Zhenglin Wang,Ting Liang,Jianguo Lin,Keping Yang

类目:Computation and Language (cs.CL)

关键词:Recent years, years have witnessed, witnessed the rapid, rapid development, development of LLM-based

备注

点击查看摘要

Abstract:Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model's ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.

35. 【2510.25320】GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning

链接https://arxiv.org/abs/2510.25320

作者:Jiaqi Wu,Qinlao Zhao,Zefeng Chen,Kai Qin,Yifei Zhao,Xueqian Wang,Yuhang Yao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Autonomous agents powered, shown impressive capabilities, Autonomous agents, large language models, powered by large

备注

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: this https URL.

36. 【2510.25310】Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

链接https://arxiv.org/abs/2510.25310

作者:Senjie Jin,Lu Chen,Zhiheng Xi,Yuhui Wang,Sirui Song,Yuhao Zhou,Xinbo Zhang,Peng Sun,Hong Lu,Tao Gui,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL)

关键词:large language models, solve mathematical reasoning, mathematical reasoning problems, Program, N-CoT

备注

点击查看摘要

Abstract:Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

37. 【2510.25303】aching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student

链接https://arxiv.org/abs/2510.25303

作者:Soumyadeep Jana,Sanasam Ranbir Singh

类目:Computation and Language (cs.CL)

关键词:subtle image-text contradictions, scarce annotated data, detection is challenging, low-resource settings, settings where subtle

备注

点击查看摘要

Abstract:Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.

38. 【2510.25273】Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

链接https://arxiv.org/abs/2510.25273

作者:Sandipan Majhi,Paheli Bhattacharya

类目:Computation and Language (cs.CL)

关键词:Domain-specific question answering, key challenges, scarcity of annotated, general-purpose language models, question answering

备注: Accepted at the Forum for Information Retrieval Evaluation 2025 (VATIKA Track)

点击查看摘要

Abstract:Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

39. 【2510.25232】From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

链接https://arxiv.org/abs/2510.25232

作者:Tianxi Wan,Jiaming Luo,Siyuan Chen,Kunyao Lan,Jianhua Chen,Haiyang Geng,Mengyue Wu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multiple co-occurring disorders, co-occurring disorders, clinically significant, significant yet challenging, challenging due

备注

点击查看摘要

Abstract:Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.

40. 【2510.25224】ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

链接https://arxiv.org/abs/2510.25224

作者:Ziyi Liu,Bahar Sarrafzadeh,Pei Zhou,Longqi Yang,Jieyu Zhao,Ashish Sharma

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, assist individual users, proactively manage complex

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65\% vs 7.01\%) while being 77\% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.

41. 【2510.25206】RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

链接https://arxiv.org/abs/2510.25206

作者:Tianqianjin Lin,Xi Zhao,Xingyao Zhang,Rujiao Long,Yi Xu,Zhuoren Jiang,Wenbo Su,Bo Zheng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, generate high-utility reasoning, Reinforcement learning, high-utility reasoning paths, reasoning

备注: 17 pages, 11 figures

点击查看摘要

Abstract:Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.

42. 【2510.25187】sting Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

链接https://arxiv.org/abs/2510.25187

作者:Ritesh Sunil Chavan,Jack Mostow

类目:Computation and Language (cs.CL)

关键词:massive datasets, trained on massive, large language models, English, large language

备注

点击查看摘要

Abstract:While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.

43. 【2510.25160】Model-Document Protocol for AI Search

链接https://arxiv.org/abs/2510.25160

作者:Hongjin Qian,Zheng Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:linking large language, external knowledge sources, vast external knowledge, search depends, depends on linking

备注: 10 pages

点击查看摘要

Abstract:AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

Comments:
10 pages

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2510.25160 [cs.CL]

(or
arXiv:2510.25160v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.25160

Focus to learn more

              arXiv-issued DOI via DataCite</p>
44. 【2510.25150】Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

链接https://arxiv.org/abs/2510.25150

作者:Shreyas Gopal,Ashutosh Anshul,Haoyang Li,Yue Heng Yeo,Hexin Liu,Eng Siong Chng

类目:Computation and Language (cs.CL)

关键词:Discrete audio representations, Discrete audio, large language models, real-world environments, audio representations

备注: Awarded Best Student Paper at APSIPA ASC 2025

点击查看摘要

Abstract:Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

45. 【2510.25117】A Survey on Unlearning in Large Language Models

链接https://arxiv.org/abs/2510.25117

作者:Ruichen Qiu,Jiajun Tan,Jiayue Pu,Honglin Wang,Xiao-Shan Gao,Fei Sun

类目:Computation and Language (cs.CL)

关键词:Large Language Models, natural language processing, revolutionized natural language, poses significant risks, sensitive personal data

备注

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the "right to be forgotten", machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.

46. 【2510.25116】Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

链接https://arxiv.org/abs/2510.25116

作者:Idriss Nguepi Nguefack,Mara Finkelstein,Toadoum Sari Sakayo

类目:Computation and Language (cs.CL)

关键词:research article examines, translation models tailored, low-resource languages, under-resourced African language, article examines

备注: 8 pages, 1. figure

点击查看摘要

Abstract:This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.

47. 【2510.25110】DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

链接https://arxiv.org/abs/2510.25110

作者:Yun-Shiuan Chuang,Ruixuan Tu,Chengtao Dai,Smit Vasani,Binwei Yao,Michael Henry Tessler,Sijia Yang,Dhavan Shah,Robert Hawkins,Junjie Hu,Timothy T. Rogers

类目:Computation and Language (cs.CL)

关键词:Accurately modeling opinion, Accurately modeling, modeling opinion change, misinformation and polarization, crucial for addressing

备注

点击查看摘要

Abstract:Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE's utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.

48. 【2510.25101】KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

链接https://arxiv.org/abs/2510.25101

作者:Zhuo Chen,Fei Wang,Zixuan Li,Zhao Zhang,Weiwei Ding,Chuanguang Yang,Yongjun Xu,Xiaolong Jin,Jiafeng Guo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:structured Knowledge Base, Knowledge Base Question, Base Question Answering, Knowledge Base, Large Language Models

备注

点击查看摘要

Abstract:Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

49. 【2510.25087】BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

链接https://arxiv.org/abs/2510.25087

作者:Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:unique challenges due, texts presents unique, presents unique challenges, complex domain-specific terminology, biomedical texts presents

备注

点击查看摘要

Abstract:Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

50. 【2510.25069】OPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

链接https://arxiv.org/abs/2510.25069

作者:Gabin Taibi,Lucia Gomez

类目:Computation and Language (cs.CL)

关键词:computational linguistics treat, linguistics treat sentiment, Traditional approaches, unidimensional scale, computational linguistics

备注: 7 pages, 3 figures and 2 tables

点击查看摘要

Abstract:Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.

51. 【2510.25064】Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

链接https://arxiv.org/abs/2510.25064

作者:Seonjeong Hwang,Hyounghun Kim,Gary Geunbae Lee

类目:Computation and Language (cs.CL)

关键词:reading comprehension, administered to learners, cognitive complexity, Estimating the cognitive, crucial for assessing

备注

点击查看摘要

Abstract:Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

52. 【2510.25055】GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models

链接https://arxiv.org/abs/2510.25055

作者:Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Scientific progress, remains unknown, progress is driven, deliberate articulation, knowledge gaps

备注

点击查看摘要

Abstract:Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.

53. 【2510.25054】Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

链接https://arxiv.org/abs/2510.25054

作者:Pedro Corrêa,João Lima,Victor Moreno,Paula Dornhofer Paro Costa

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:spoken language models, spoken language processing, achieve universal audio, universal audio understanding, jointly learning text

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

54. 【2510.25017】StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems

链接https://arxiv.org/abs/2510.25017

作者:Qi Lin,Zhenyu Zhang,Viraj Thakkar,Zhenjie Sun,Mai Zheng,Zhichao Cao

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Automatically configuring storage, configuring storage systems, Automatically configuring, parameter spaces, vary across workloads

备注: ArXiv version; Affiliations: Arizona State University (Lin, Zhang, Thakkar, Sun, Cao) and Iowa State University (Zheng)

点击查看摘要

Abstract:Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM-based approaches help but usually treat tuning as a single-shot, system-specific task, which limits cross-system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent-driven auto-tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents - Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight-guided configuration exploration), and Reflector (insight generation and management). The design couples an insight-driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC-H/C. Relative to out-of-the-box settings and to ELMo-Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.

55. 【2510.25013】Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

链接https://arxiv.org/abs/2510.25013

作者:Rabin Adhikari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Mechanistic interpretability aims, reverse-engineer large language, Mechanistic interpretability, large language models, Indirect Object Identification

备注: 9 pages, 10 figures

点击查看摘要

Abstract:Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

56. 【2510.24992】POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

链接https://arxiv.org/abs/2510.24992

作者:Chin-Jou Li,Kalvin Chang,Shikhar Bharadwaj,Eunjung Yeo,Kwanghee Choi,Jian Zhu,David Mortensen,Shinji Watanabe

类目:Computation and Language (cs.CL)

关键词:automatic speech recognition, Recent advances, spoken language processing, advances in spoken, spoken language

备注: 14 pages, under review

点击查看摘要

Abstract:Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

57. 【2510.24966】Sequences of Logits Reveal the Low Rank Structure of Language Models

链接https://arxiv.org/abs/2510.24966

作者:Noah Golowich,Allen Liu,Abhishek Shetty

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:inherent low-dimensional structure, large language models, language models, major problem, understand their inherent

备注

点击查看摘要

Abstract:A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

Cite as:
arXiv:2510.24966 [cs.LG]

(or
arXiv:2510.24966v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.24966

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
58. 【2510.24963】Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

链接https://arxiv.org/abs/2510.24963

作者:James A. Michaelov,Roger P. Levy,Benjamin K. Bergen

类目:Computation and Language (cs.CL)

关键词:Transformer vs. Mamba, Mamba vs. RWKV, models exhibit highly, exhibit highly consistent, highly consistent patterns

备注: To be presented at NeurIPS 2025

点击查看摘要

Abstract:We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.

59. 【2510.24942】Finding Culture-Sensitive Neurons in Vision-Language Models

链接https://arxiv.org/abs/2510.24942

作者:Xiutian Zhao,Rochelle Choenni,Rohit Saxena,Ivan Titov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:culturally situated inputs, vision-language models, neurons, situated inputs, Contrastive Activation Selection

备注: 22 pages, 13 figures

点击查看摘要

Abstract:Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.

60. 【2510.24940】SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

链接https://arxiv.org/abs/2510.24940

作者:Yinhan He,Wendy Zheng,Yaochen Zhu,Zaiyi Zheng,Lin Su,Sriram Vasudevan,Qi Guo,Liangjie Hong,Jundong Li

类目:Computation and Language (cs.CL)

关键词:implicit reasoning, reasoning, implicit, efficiency-critical applications, CoT

备注

点击查看摘要

Abstract:The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM's hidden embeddings (termed ``implicit reasoning'') rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at this https URL.

61. 【2510.24934】Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

链接https://arxiv.org/abs/2510.24934

作者:James A. Michaelov,Catherine Arnett

类目:Computation and Language (cs.CL)

关键词:produce grammatical text, Language models, make errors, models generally produce, language model behavior

备注: Accepted to the First Workshop on Interpreting Cognition in Deep Learning Models (CogInterp @ NeurIPS 2025)

点击查看摘要

Abstract:Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.

62. 【2510.24932】RiddleBench: A New Generative Reasoning Benchmark for LLMs

链接https://arxiv.org/abs/2510.24932

作者:Deepon Halder,Alan Saji,Thanmay Jayakumar,Ratish Puduppully,Anoop Kunchukuttan,Raj Dabre

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Large, established reasoning benchmarks, Models

备注

点击查看摘要

Abstract:Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.

63. 【2510.24891】Idea2Plan: Exploring AI-Powered Research Planning

链接https://arxiv.org/abs/2510.24891

作者:Jin Huang,Silviu Cucerzan,Sujay Kumar Jauhar,Ryen W. White

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, accelerate scientific discovery, demonstrated significant potential, supporting innovative approaches, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.

64. 【2510.24870】Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

链接https://arxiv.org/abs/2510.24870

作者:Alexander Martin,William Walden,Reno Kriz,Dengjia Zhang,Kate Sanders,Eugene Yang,Chihsheng Jin,Benjamin Van Durme

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:RAG, framework for retrieval-augmented, retrieval-augmented generation, multimodal RAG, multimodal RAG evaluation

备注: [this https URL](https://github.com/alexmartin1722/mirage)

点击查看摘要

Abstract:We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.

65. 【2510.24856】Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

链接https://arxiv.org/abs/2510.24856

作者:Lujun Li,Yewei Song,Lama Sleem,Yiqun Wang,Yangjie Xu,Cedric Lothritz,Niccolo Gentile,Radu State,Tegawende F. Bissyande,Jacques Klein

类目:Computation and Language (cs.CL)

关键词:Grammar Book Guided, system of rules, rules that governs, governs the structural, structural organization

备注

点击查看摘要

Abstract:Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.

66. 【2510.24824】Parallel Loop Transformer for Efficient Test-Time Computation Scaling

链接https://arxiv.org/abs/2510.24824

作者:Bohong Wu,Mengzhao Chen,Xiang Luo,Shen Yan,Qifan Yu,Fan Xia,Tianqi Zhang,Hongrui Zhan,Zheng Zhong,Xun Zhou,Siyuan Qiao,Xingyan Bin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, slow and costly, costly for real-world

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory (KV cache) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard transformer.

67. 【2510.24817】owards a Method for Synthetic Generation of PWA Transcripts

链接https://arxiv.org/abs/2510.24817

作者:Jason M. Pittman,Anton Phillips Jr.,Yesenia Medina-Santos,Brielle C. Stark

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Correct Information Units, Information Units, coding speech samples, Correct Information, manually coding speech

备注: 19 pages, 1 figure, 7 tables

点击查看摘要

Abstract:In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.

68. 【2510.24811】ProofSketch: Efficient Verified Reasoning for Large Language Models

链接https://arxiv.org/abs/2510.24811

作者:Disha Sheshanarayana,Tanishka Magar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:shown immense potential, large language models, prompting and self-consistency, self-consistency have shown, shown immense

备注: Accepted at NeurIPS 2025, ER Workshop

点击查看摘要

Abstract:Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.

69. 【2510.24810】COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

链接https://arxiv.org/abs/2510.24810

作者:Rui Xing,Preslav Nakov,Timothy Baldwin,Jey Han Lau

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:users contribute explanatory, major platforms, community-based setup, shifting from expert-driven, expert-driven verification

备注

点击查看摘要

Abstract:Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information are beneficial for existing fact-checking systems.

70. 【2510.24804】Conflict Adaptation in Vision-Language Models

链接https://arxiv.org/abs/2510.24804

作者:Xiaoyang Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:improved performance, high-conflict trial, cognitive control, human cognitive control, conflict adaptation

备注: Workshop on Interpreting Cognition in Deep Learning Models at NeurIPS 2025

点击查看摘要

Abstract:A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.

71. 【2510.24801】Fortytwo: Swarm Inference with Peer-Ranked Consensus

链接https://arxiv.org/abs/2510.24801

作者:Vladyslav Larin,Ihor Naumenko,Aleksei Ivashov,Ivan Nikitin,Alexander Firsov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:ever-larger training runs, meeting demand requires, hits compute ceilings, training runs, meeting demand

备注

点击查看摘要

Abstract:As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.

72. 【2510.24797】Large Language Models Report Subjective Experience Under Self-Referential Processing

链接https://arxiv.org/abs/2510.24797

作者:Cameron Berg,Diogo de Lucena,Judd Rosenblatt

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:explicitly reference awareness, explicitly reference, reference awareness, Large language models, subjective experience

备注

点击查看摘要

Abstract:Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

73. 【2510.24794】MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

链接https://arxiv.org/abs/2510.24794

作者:Xinming Wang,Jian Xu,Bin Yu,Sheng Lian,Hongzhu Yi,Yi Chen,Yingjian Zhu,Boran Wang,Hongming Yang,Han Hu,Xu-Yao Zhang,Cheng-Lin Liu

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, evidence-dependent factual questions, show strong capabilities, questions are limited, Large reasoning

备注: Preprint

点击查看摘要

Abstract:Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

74. 【2510.24793】SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications

链接https://arxiv.org/abs/2510.24793

作者:Edouard Lansiaux

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:MTEB average score, contextual model quality, MTEB average, token lookup methodology, text embedding generation

备注

点击查看摘要

Abstract:We present a static token lookup methodology for text embedding generation that achieves 1.12 ms p50 latency for single text embeddings while maintaining 60.6 MTEB average score across 8 representative tasks, corresponding to 89% of contextual model quality. The Rust implementation delivers 50,000 requests per second throughput through static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP), strong semantic similarity (76.1% Spearman correlation), and domain-specific performance ranging from 75% to 131% of baseline across specialized domains. The system enables real-time embedding applications where sub-5ms latency is critical.

75. 【2510.24789】Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

链接https://arxiv.org/abs/2510.24789

作者:Gokul Ganesan

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:identify AI-generated text, schemes typically relying, token distributions, lightweight mechanism, mechanism to identify

备注

点击查看摘要

Abstract:Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

76. 【2510.24774】PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination

链接https://arxiv.org/abs/2510.24774

作者:Hyunseung Lim,Sooyohn Nam,Sungmin Na,Ji Yong Cho,June Yong Yang,Hyungyu Shin,Yoonjoo Lee,Juho Kim,Moontae Lee,Hwajung Hong

类目:Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词:nuanced human judgment, submitted claim meets, previously granted claims, NLP literature, large language models

备注

点击查看摘要

Abstract:Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims -- prior art -- in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at this https URL.

77. 【2510.24772】Confidence is Not Competence

链接https://arxiv.org/abs/2510.24772

作者:Debdeep Sanyal,Manya Pandey,Dhruv Kumar,Saurabh Deshpande,Murari Mandal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, actual problem-solving competence, Large language, problem-solving competence, exhibit a puzzling

备注: 20 Pages, 6 Figures, 8 Tables

点击查看摘要

Abstract:Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal "solvability belief" of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.

78. 【2510.24765】opic-aware Large Language Models for Summarizing the Lived Healthcare Experiences Described in Health Stories

链接https://arxiv.org/abs/2510.24765

作者:Maneesh Bilalpur,Megan Hamm,Young Ji Lee,Natasha Norman,Kathleen M. McTigue,Yanshan Wang

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Latent Dirichlet Allocation, African American, powerful form, form of communication

备注

点击查看摘要

Abstract:Storytelling is a powerful form of communication and may provide insights into factors contributing to gaps in healthcare outcomes. To determine whether Large Language Models (LLMs) can identify potential underlying factors and avenues for intervention, we performed topic-aware hierarchical summarization of narratives from African American (AA) storytellers. Fifty transcribed stories of AA experiences were used to identify topics in their experience using the Latent Dirichlet Allocation (LDA) technique. Stories about a given topic were summarized using an open-source LLM-based hierarchical summarization approach. Topic summaries were generated by summarizing across story summaries for each story that addressed a given topic. Generated topic summaries were rated for fabrication, accuracy, comprehensiveness, and usefulness by the GPT4 model, and the model's reliability was validated against the original story summaries by two domain experts. 26 topics were identified in the fifty AA stories. The GPT4 ratings suggest that topic summaries were free from fabrication, highly accurate, comprehensive, and useful. The reliability of GPT ratings compared to expert assessments showed moderate to high agreement. Our approach identified AA experience-relevant topics such as health behaviors, interactions with medical team members, caregiving and symptom management, among others. Such insights could help researchers identify potential factors and interventions by learning from unstructured narratives in an efficient manner-leveraging the communicative power of storytelling. The use of LDA and LLMs to identify and summarize the experience of AA individuals suggests a variety of possible avenues for health research and possible clinical improvements to support patients and caregivers, thereby ultimately improving health outcomes.

79. 【2510.24762】Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

链接https://arxiv.org/abs/2510.24762

作者:Wenzhen Luo,Wei Guan,Yifan Yao,Yimin Pan,Feng Wang,Zhipeng Yu,Zhe Wen,Liang Chen,Yihong Zhuang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:benchmark grounded, cross-domain Chinese, automated evaluation pipeline, Chinese, introduce Falcon

备注

点击查看摘要

Abstract:We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.

80. 【2510.24760】Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

链接https://arxiv.org/abs/2510.24760

作者:Mengyuan Chen,Chengjun Dai,Xinyang Dong,Chengzhe Feng,Kewei Fu,Jianshe Li,Zhihan Peng,Yongqi Tong,Junshao Zhang,Hong Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:present Dingtalk DeepResearch, delivering deep research, heterogeneous table reasoning, world enterprise environments, multimodal report generation

备注

点击查看摘要

Abstract:We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.

81. 【2510.24729】Beyond Models: A Framework for Contextual and Cultural Intelligence in African AI Deployment

链接https://arxiv.org/abs/2510.24729

作者:Qness Ndlovu

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:development prioritizes model, prioritizes model performance, African markets requires, markets requires fundamentally, computational scale

备注: 25 pages, 4 tables. Production validation with 602 users across Zimbabwe-South Africa diaspora corridor

点击查看摘要

Abstract:While global AI development prioritizes model performance and computational scale, meaningful deployment in African markets requires fundamentally different architectural decisions. This paper introduces Contextual and Cultural Intelligence (CCI) -- a systematic framework enabling AI systems to process cultural meaning, not just data patterns, through locally relevant, emotionally intelligent, and economically inclusive design. Using design science methodology, we validate CCI through a production AI-native cross-border shopping platform serving diaspora communities. Key empirical findings: 89% of users prefer WhatsApp-based AI interaction over traditional web interfaces (n=602, chi-square=365.8, p0.001), achieving 536 WhatsApp users and 3,938 total conversations across 602 unique users in just 6 weeks, and culturally informed prompt engineering demonstrates sophisticated understanding of culturally contextualized queries, with 89% family-focused commerce patterns and natural code-switching acceptance. The CCI framework operationalizes three technical pillars: Infrastructure Intelligence (mobile-first, resilient architectures), Cultural Intelligence (multilingual NLP with social context awareness), and Commercial Intelligence (trust-based conversational commerce). This work contributes both theoretical innovation and reproducible implementation patterns, challenging Silicon Valley design orthodoxies while providing actionable frameworks for equitable AI deployment across resource-constrained markets.

82. 【2510.24724】AmarDoctor: An AI-Driven, Multilingual, Voice-Interactive Digital Health Application for Primary Care Triage and Patient Management to Bridge the Digital Health Divide for Bengali Speakers

链接https://arxiv.org/abs/2510.24724

作者:Nazmun Nahar,Ritesh Harshad Ruparel,Shariar Kabir,Sumaiya Tasnia Khan,Shyamasree Saha,Mamunur Rashid

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:population largely underserved, study presents AmarDoctor, provide comprehensive patient, comprehensive patient triage, health app designed

备注

点击查看摘要

Abstract:This study presents AmarDoctor, a multilingual voice-interactive digital health app designed to provide comprehensive patient triage and AI-driven clinical decision support for Bengali speakers, a population largely underserved in access to digital healthcare. AmarDoctor adopts a data-driven approach to strengthen primary care delivery and enable personalized health management. While platforms such as AdaHealth, WebMD, Symptomate, and K-Health have become popular in recent years, they mainly serve European demographics and languages. AmarDoctor addresses this gap with a dual-interface system for both patients and healthcare providers, supporting three major Bengali dialects. At its core, the patient module uses an adaptive questioning algorithm to assess symptoms and guide users toward the appropriate specialist. To overcome digital literacy barriers, it integrates a voice-interactive AI assistant that navigates users through the app services. Complementing this, the clinician-facing interface incorporates AI-powered decision support that enhances workflow efficiency by generating structured provisional diagnoses and treatment recommendations. These outputs inform key services such as e-prescriptions, video consultations, and medical record management. To validate clinical accuracy, the system was evaluated against a gold-standard set of 185 clinical vignettes developed by experienced physicians. Effectiveness was further assessed by comparing AmarDoctor performance with five independent physicians using the same vignette set. Results showed AmarDoctor achieved a top-1 diagnostic precision of 81.08 percent (versus physicians average of 50.27 percent) and a top specialty recommendation precision of 91.35 percent (versus physicians average of 62.6 percent).

83. 【2510.24721】he Epistemic Suite: A Post-Foundational Diagnostic Methodology for Assessing AI Knowledge Claims

链接https://arxiv.org/abs/2510.24721

作者:Matthew Kelly

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:mistaking simulated coherence, Large Language Models, Large Language, generate fluent, plausible text

备注: 65 pages

点击查看摘要

Abstract:Large Language Models (LLMs) generate fluent, plausible text that can mislead users into mistaking simulated coherence for genuine understanding. This paper introduces the Epistemic Suite, a post-foundational diagnostic methodology for surfacing the epistemic conditions under which AI outputs are produced and received. Rather than determining truth or falsity, the Suite operates through twenty diagnostic lenses, applied by practitioners as context warrants, to reveal patterns such as confidence laundering, narrative compression, displaced authority, and temporal drift. It is grounded in three design principles: diagnosing production before evaluating claims, preferring diagnostic traction over foundational settlement, and embedding reflexivity as a structural requirement rather than an ethical ornament. When enacted, the Suite shifts language models into a diagnostic stance, producing inspectable artifacts-flags, annotations, contradiction maps, and suspension logs (the FACS bundle)-that create an intermediary layer between AI output and human judgment. A key innovation is epistemic suspension, a practitioner-enacted circuit breaker that halts continuation when warrant is exceeded, with resumption based on judgment rather than rule. The methodology also includes an Epistemic Triage Protocol and a Meta-Governance Layer to manage proportionality and link activation to relational accountability, consent, historical context, and pluralism safeguards. Unlike internalist approaches that embed alignment into model architectures (e.g., RLHF or epistemic-integrity proposals), the Suite operates externally as scaffolding, preserving expendability and refusal as safeguards rather than failures. It preserves the distinction between performance and understanding, enabling accountable deliberation while maintaining epistemic modesty.

84. 【2510.24719】Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries

链接https://arxiv.org/abs/2510.24719

作者:Shravan Gadbail,Masumi Desai,Kamalakar Karlapalem

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, Large Language, advancement of Large, rapid advancement

备注

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.

85. 【2510.01225】Utilizing Modern Large Language Models (LLM) for Financial Trend Analysis and Digest Creation

链接https://arxiv.org/abs/2510.01225

作者:Andrei Lazarev,Dmitrii Sedov

类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Large Language Models, Google Gemini Pro, specifically Google Gemini, automatically generating insightful, generating insightful financial

备注: This is the version of the article accepted for publication in SUMMA 2024 after peer review. The final, published version is available at IEEE Xplore: [https://doi.org/10.1109/SUMMA64428.2024.10803746](https://doi.org/10.1109/SUMMA64428.2024.10803746)

点击查看摘要

Abstract:The exponential growth of information presents a significant challenge for researchers and professionals seeking to remain at the forefront of their fields and this paper introduces an innovative framework for automatically generating insightful financial digests using the power of Large Language Models (LLMs), specifically Google's Gemini Pro. By leveraging a combination of data extraction from OpenAlex, strategic prompt engineering, and LLM-driven analysis, we demonstrate the automated example of creating a comprehensive digests that generalize key findings, identify emerging trends. This approach addresses the limitations of traditional analysis methods, enabling the efficient processing of vast amounts of unstructured data and the delivery of actionable insights in an easily digestible format. This paper describes how LLMs work in simple words and how we can use their power to help researchers and scholars save their time and stay informed about current trends. Our study includes step-by-step process, from data acquisition and JSON construction to interaction with Gemini and the automated generation of PDF reports, including a link to the project's GitHub repository for broader accessibility and further development.

86. 【2510.25577】Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

链接https://arxiv.org/abs/2510.25577

作者:Harm Lameris,Shree Harsha Bokkahalli Satish,Joakim Gustafson,Éva Székely

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:bypassing intermediate textual, intermediate textual representations, Recent advances, raw audio, bypassing intermediate

备注: 8 pages, 3 figures, 4 tables, submitted to LREC 2026

点击查看摘要

Abstract:Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.

信息检索

1. 【2510.25718】Retrieval-Augmented Search for Large-Scale Map Collections with ColPali

链接https://arxiv.org/abs/2510.25718

作者:Jamie Mahowald,Benjamin Charles Germain Lee

类目:Information Retrieval (cs.IR); Digital Libraries (cs.DL)

关键词:shown great promise, Multimodal approaches, approaches have shown, shown great, great promise

备注: 5 pages, 5 figures

点击查看摘要

Abstract:Multimodal approaches have shown great promise for searching and navigating digital collections held by libraries, archives, and museums. In this paper, we introduce map-RAS: a retrieval-augmented search system for historic maps. In addition to introducing our framework, we detail our publicly-hosted demo for searching 101,233 map images held by the Library of Congress. With our system, users can multimodally query the map collection via ColPali, summarize search results using Llama 3.2, and upload their own collections to perform inter-collection search. We articulate potential use cases for archivists, curators, and end-users, as well as future work with our system in both machine learning and the digital humanities. Our demo can be viewed at: this http URL.

2. 【2510.25622】MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation

链接https://arxiv.org/abs/2510.25622

作者:Yi Xu,Moyu Zhang,Chaofan Fan,Jinxin Hu,Xiaochen Li,Yu Zhang,Xiaoyi Zeng,Jing Zhang

类目:Information Retrieval (cs.IR)

关键词:unique Item Identifiers, recommender systems rely, Industrial recommender systems, Item Identifiers, recommender systems

备注

点击查看摘要

Abstract:Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail data. Content-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks.

3. 【2510.25621】FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

链接https://arxiv.org/abs/2510.25621

作者:Mohammad Aghajani Asl,Behrooz Minaei Bidgoli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, Natural Language Processing, revolutionized Natural Language, Language Models, Language Processing

备注: 37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

4. 【2510.25488】Generalized Pseudo-Relevance Feedback

链接https://arxiv.org/abs/2510.25488

作者:Yiteng Tu,Weihang Su,Yujia Zhou,Yiqun Liu,Fen Lin,Qin Liu,Qingyao Ai

类目:Information Retrieval (cs.IR)

关键词:relevance feedback, relevance, fundamental technique, technique in information, relevance assumption

备注

点击查看摘要

Abstract:Query rewriting is a fundamental technique in information retrieval (IR). It typically employs the retrieval result as relevance feedback to refine the query and thereby addresses the vocabulary mismatch between user queries and relevant documents. Traditional pseudo-relevance feedback (PRF) and its vector-based extension (VPRF) improve retrieval performance by leveraging top-retrieved documents as relevance feedback. However, they are constructed based on two major hypotheses: the relevance assumption (top documents are relevant) and the model assumption (rewriting methods need to be designed specifically for particular model architectures). While recent large language models (LLMs)-based generative relevance feedback (GRF) enables model-free query reformulation, it either suffers from severe LLM hallucination or, again, relies on the relevance assumption to guarantee the effectiveness of rewriting quality. To overcome these limitations, we introduce an assumption-relaxed framework: \textit{Generalized Pseudo Relevance Feedback} (GPRF), which performs model-free, natural language rewriting based on retrieved documents, not only eliminating the model assumption but also reducing dependence on the relevance assumption. Specifically, we design a utility-oriented training pipeline with reinforcement learning to ensure robustness against noisy feedback. Extensive experiments across multiple benchmarks and retrievers demonstrate that GPRF consistently outperforms strong baselines, establishing it as an effective and generalizable framework for query rewriting.

5. 【2510.25428】Alibaba International E-commerce Product Search Competition DcuRAGONs Team Technical Report

链接https://arxiv.org/abs/2510.25428

作者:Thang-Long Nguyen-Ho,Minh-Khoi Pham,Hoang-Bao Le

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:E-commerce Search Competition, Multilingual E-commerce Search, Search Competition, E-commerce Search, Large Language Models

备注: Alibaba International E-commerce Product Search Competition @ CIKM 2025

点击查看摘要

Abstract:This report details our methodology and results developed for the Multilingual E-commerce Search Competition. The problem aims to recognize relevance between user queries versus product items in a multilingual context and improve recommendation performance on e-commerce platforms. Utilizing Large Language Models (LLMs) and their capabilities in other tasks, our data-centric method achieved the highest score compared to other solutions during the competition. Final leaderboard is publised at this https URL. The source code for our project is published at this https URL.

6. 【2510.25402】owards Automated Quality Assurance of Patent Specifications: A Multi-Dimensional LLM Framework

链接https://arxiv.org/abs/2510.25402

作者:Yuqian Chai,Chaochao Wang,Weilei Wang

类目:Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE)

关键词:content quality represents, patent content quality, gained prominence, content quality, quality represents

备注

点击查看摘要

Abstract:Although AI drafting tools have gained prominence in patent writing, the systematic evaluation of AI-generated patent content quality represents a significant research gap. To address this gap, We propose to evaluate patents using regulatory compliance, technical coherence, and figure-reference consistency detection modules, and then generate improvement suggestions via an integration module. The framework is validated on a comprehensive dataset comprising 80 human-authored and 80 AI-generated patents from two patent drafting tools. Evaluation is performed on 10,841 total sentences, 8,924 non-template sentences, and 554 patent figures for the three detection modules respectively, achieving balanced accuracies of 99.74%, 82.12%, and 91.2% against expert annotations. Additional analysis was conducted to examine defect distributions across patent sections, technical domains, and authoring sources. Section-based analysis indicates that figure-text consistency and technical detail precision require particular attention. Mechanical Engineering and Construction show more claim-specification inconsistencies due to complex technical documentation requirements. AI-generated patents show a significant gap compared to human-authored ones. While human-authored patents primarily contain surface-level errors like typos, AI-generated patents exhibit more structural defects in figure-text alignment and cross-references.

7. 【2510.25285】Revisiting scalable sequential recommendation with Multi-Embedding Approach and Mixture-of-Experts

链接https://arxiv.org/abs/2510.25285

作者:Qiushi Pan,Hao Wang,Guoyuan An,Luankang Zhang,Wei Guo,Yong Liu

类目:Information Retrieval (cs.IR)

关键词:essential research topic, research topic, effectively scale, essential research, recommendation systems

备注

点击查看摘要

Abstract:In recommendation systems, how to effectively scale up recommendation models has been an essential research topic. While significant progress has been made in developing advanced and scalable architectures for sequential recommendation(SR) models, there are still challenges due to items' multi-faceted characteristics and dynamic item relevance in the user context. To address these issues, we propose Fuxi-MME, a framework that integrates a multi-embedding strategy with a Mixture-of-Experts (MoE) architecture. Specifically, to efficiently capture diverse item characteristics in a decoupled manner, we decompose the conventional single embedding matrix into several lower-dimensional embedding matrices. Additionally, by substituting relevant parameters in the Fuxi Block with an MoE layer, our model achieves adaptive and specialized transformation of the enriched representations. Empirical results on public datasets show that our proposed framework outperforms several competitive baselines.

8. 【2510.25283】Measuring the Research Output and Performance of the University of Ibadan from 2014 to 2023: A Scientometric Analysis

链接https://arxiv.org/abs/2510.25283

作者:Muneer Ahmad,Undie Felicia Nkatv

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:employs scientometric methods, university research, university research output, study employs scientometric, University of Ibadan

备注: 16 pages, 5 figures, Research Paper

点击查看摘要

Abstract:This study employs scientometric methods to assess the research output and performance of the University of Ibadan from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university's research productivity, impact, and disciplinary focus. This article's endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Ibadan a significant hub for cutting-edge research in Nigeria and beyond. The goal of the current study is to ascertain the influence of the university's research output and publication patterns between 2014 and 2023. The study focuses on the departments at the University of Ibadan that contribute the most, the best journals for publishing, the nations that collaborate, the impact of citations both locally and globally, well-known authors and their total production, and the research output broken down by year. According to the university's ten-year publication data, 7159 papers with an h-index of 75 were published between 2014 and 2023, garnering 218572 citations. Furthermore, the VOSviewer software mapping approach is used to illustrate the stenographical mapping of data through graphs. The findings of this study will contribute to understanding the university's research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Ibadan.

9. 【2510.25259】V-Rec: Time-Variant Convolutional Filter for Sequential Recommendation

链接https://arxiv.org/abs/2510.25259

作者:Yehjin Shin,Jeongwhan Choi,Seojin Kim,Noseong Park

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:convolutional filters, increasingly adopted, capture local sequential, Recently, local sequential patterns

备注: The 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.

10. 【2510.25220】GReF: A Unified Generative Framework for Efficient Reranking via Ordered Multi-token Prediction

链接https://arxiv.org/abs/2510.25220

作者:Zhijie Lin,Zhuofeng Li,Chenglei Dai,Wentian Bao,Shuai Lin,Enyun Yu,Haoxiang Zhang,Liang Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:modeling intra-list correlations, plays a crucial, crucial role, role in modeling, modeling intra-list

备注: Accepted by CIKM 2025

点击查看摘要

Abstract:In a multi-stage recommendation system, reranking plays a crucial role in modeling intra-list correlations among items. A key challenge lies in exploring optimal sequences within the combinatorial space of permutations. Recent research follows a two-stage (generator-evaluator) paradigm, where a generator produces multiple feasible sequences, and an evaluator selects the best one. In practice, the generator is typically implemented as an autoregressive model. However, these two-stage methods face two main challenges. First, the separation of the generator and evaluator hinders end-to-end training. Second, autoregressive generators suffer from inference efficiency. In this work, we propose a Unified Generative Efficient Reranking Framework (GReF) to address the two primary challenges. Specifically, we introduce Gen-Reranker, an autoregressive generator featuring a bidirectional encoder and a dynamic autoregressive decoder to generate causal reranking sequences. Subsequently, we pre-train Gen-Reranker on the item exposure order for high-quality parameter initialization. To eliminate the need for the evaluator while integrating sequence-level evaluation during training for end-to-end optimization, we propose post-training the model through Rerank-DPO. Moreover, for efficient autoregressive inference, we introduce ordered multi-token prediction (OMTP), which trains Gen-Reranker to simultaneously generate multiple future items while preserving their order, ensuring practical deployment in real-time recommender systems. Extensive offline experiments demonstrate that GReF outperforms state-of-the-art reranking methods while achieving latency that is nearly comparable to non-autoregressive models. Additionally, GReF has also been deployed in a real-world video app Kuaishou with over 300 million daily active users, significantly improving online recommendation quality.

11. 【2510.25160】Model-Document Protocol for AI Search

链接https://arxiv.org/abs/2510.25160

作者:Hongjin Qian,Zheng Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:linking large language, external knowledge sources, vast external knowledge, search depends, depends on linking

备注: 10 pages

点击查看摘要

Abstract:AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

Comments:
10 pages

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2510.25160 [cs.CL]

(or
arXiv:2510.25160v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.25160

Focus to learn more

              arXiv-issued DOI via DataCite</p>
12. 【2510.25093】Continual Low-Rank Adapters for LLM-based Generative Recommender Systems

链接https://arxiv.org/abs/2510.25093

作者:Hyunsik Yoo,Ting-Wei Li,SeongKu Kang,Zhining Liu,Charlie Xu,Qilin Qi,Hanghang Tong

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:achieve strong performance, large language models, achieve strong, evolve over time, large language

备注

点击查看摘要

Abstract:While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.

13. 【2510.25025】Secure Retrieval-Augmented Generation against Poisoning Attacks

链接https://arxiv.org/abs/2510.25025

作者:Zirui Cheng,Jikai Sun,Anjun Gao,Yueyang Quan,Zhuqing Liu,Xiaohua Hu,Minghong Fang

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large language models, natural language processing, transformed natural language, Large language, language models

备注: To appear in IEEE BigData 2025

点击查看摘要

Abstract:Large language models (LLMs) have transformed natural language processing (NLP), enabling applications from content generation to decision support. Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external knowledge but also introduces security risks, particularly from data poisoning, where the attacker injects poisoned texts into the knowledge database to manipulate system outputs. While various defenses have been proposed, they often struggle against advanced attacks. To address this, we introduce RAGuard, a detection framework designed to identify poisoned texts. RAGuard first expands the retrieval scope to increase the proportion of clean texts, reducing the likelihood of retrieving poisoned content. It then applies chunk-wise perplexity filtering to detect abnormal variations and text similarity filtering to flag highly similar texts. This non-parametric approach enhances RAG security, and experiments on large-scale datasets demonstrate its effectiveness in detecting and mitigating poisoning attacks, including strong adaptive attacks.

14. 【2510.24870】Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

链接https://arxiv.org/abs/2510.24870

作者:Alexander Martin,William Walden,Reno Kriz,Dengjia Zhang,Kate Sanders,Eugene Yang,Chihsheng Jin,Benjamin Van Durme

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:RAG, framework for retrieval-augmented, retrieval-augmented generation, multimodal RAG, multimodal RAG evaluation

备注: [this https URL](https://github.com/alexmartin1722/mirage)

点击查看摘要

Abstract:We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.

15. 【2510.24761】ODataX: A Progressive Evolution of the Open Data Protocol

链接https://arxiv.org/abs/2510.24761

作者:Anirudh Ganesh,Nitin Sood

类目:Databases (cs.DB); Information Retrieval (cs.IR); Software Engineering (cs.SE)

关键词:Open Data Protocol, Open Data, consuming RESTful APIs, rich query capabilities, Data Protocol

备注

点击查看摘要

Abstract:The Open Data Protocol (OData) provides a standardized approach for building and consuming RESTful APIs with rich query capabilities. Despite its power and maturity, OData adoption remains confined primarily to enterprise environments, particularly within Microsoft and SAP ecosystems. This paper analyzes the key barriers preventing wider OData adoption and introduces ODataX, an evolved version of the protocol designed to address these limitations. ODataX maintains backward compatibility with OData v4 while introducing progressive complexity disclosure through simplified query syntax, built-in performance guardrails via query cost estimation, and enhanced caching mechanisms. This work aims to bridge the gap between enterprise-grade query standardization and the simplicity demanded by modern web development practices.

16. 【2510.24719】Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries

链接https://arxiv.org/abs/2510.24719

作者:Shravan Gadbail,Masumi Desai,Kamalakar Karlapalem

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, Large Language, advancement of Large, rapid advancement

备注

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.

计算机视觉

1. 【2510.25772】VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

链接https://arxiv.org/abs/2510.25772

作者:Baolu Li,Yiming Zhang,Qinghe Wang,Liqian Ma,Xiaoyu Shi,Xintao Wang,Pengfei Wan,Zhenfei Yin,Yunzhi Zhuge,Huchuan Lu,Xu Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual effects, digital media, expressive power, power of digital, remains a major

备注: Project Page URL: [this https URL](https://libaolu312.github.io/VFXMaster/)

点击查看摘要

Abstract:Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

2. 【2510.25765】FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

链接https://arxiv.org/abs/2510.25765

作者:Chuhao Chen,Isabella Liu,Xinyue Wei,Hao Su,Minghua Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:applications in robotics, diffusion models, Score Distillation Sampling, Articulated, diffusion

备注

点击查看摘要

Abstract:Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.

3. 【2510.25760】Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

链接https://arxiv.org/abs/2510.25760

作者:Xu Zheng,Zihao Dongfang,Lutao Jiang,Boyuan Zheng,Yulong Guo,Zhenquan Zhang,Giuliano Albanese,Runyi Yang,Mengjiao Ma,Zixin Zhang,Chenfei Liao,Dingcheng Zhen,Yuanhuiyi Lyu,Yuqian Fu,Bin Ren,Linfeng Zhang,Danda Pani Paudel,Nicu Sebe,Luc Van Gool,Xuming Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Humans possess spatial, Humans possess, vision and sound, possess spatial reasoning, multimodal spatial reasoning

备注

点击查看摘要

Abstract:Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at this https URL.

4. 【2510.25739】Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

链接https://arxiv.org/abs/2510.25739

作者:Zhi-Kai Chen,Jun-Peng Jiang,Han-Jia Ye,De-Chuan Zhan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:slow inference due, producing high-fidelity images, inherently sequential, capable of producing, producing high-fidelity

备注

点击查看摘要

Abstract:Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.

5. 【2510.25594】Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

链接https://arxiv.org/abs/2510.25594

作者:Arani Roy,Marco P. Apolinario,Shristi Das Biswas,Kaushik Roy

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:requires global error, global error propagation, deep neural networks, neural networks, Training deep neural

备注

点击查看摘要

Abstract:Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.

6. 【2510.25590】RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

链接https://arxiv.org/abs/2510.25590

作者:Pengtao Chen,Xianfang Zeng,Maosen Zhao,Mingzhu Shen,Peng Ye,Bangyin Xiang,Zhibo Wang,Wei Cheng,Gang Yu,Tao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:received widespread attention, instruction-based image editing, IIE, widespread attention, received widespread

备注: 26 pages, 10 figures, 18 tables

点击查看摘要

Abstract:Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.

7. 【2510.25522】Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

链接https://arxiv.org/abs/2510.25522

作者:Doan-Van-Anh Ly(1),Thi-Thu-Hien Pham(2 and 3),Thanh-Hai Le(1) ((1) The Saigon International University, (2) International University, (3) Vietnam National University HCMC)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:contrast-enhanced computed tomography, multi-phase contrast-enhanced computed, including tumor detection, liver tumor segmentation, computed tomography

备注: 27 pages, 8 figures

点击查看摘要

Abstract:Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model's superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region's most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.

8. 【2510.25512】FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

链接https://arxiv.org/abs/2510.25512

作者:Amin Parchami-Araghi,Sukrut Rao,Jonas Fischer,Bernt Schiele

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:global concept-level understanding, Deep networks, shown remarkable performance, range of tasks, key challenge

备注: Accepted to NeurIPS 2025; Code is available at [this https URL](https://github.com/m-parchami/FaCT)

点击查看摘要

Abstract:Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.

9. 【2510.25463】SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments

链接https://arxiv.org/abs/2510.25463

作者:Hongjie Zhang,Gideon Billings,Stefan B. Williams

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:harsh marine conditions, infrastructure requires frequent, Underwater infrastructure requires, requires frequent inspection, marine conditions

备注

点击查看摘要

Abstract:Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.

10. 【2510.25440】More than a Moment: Towards Coherent Sequences of Audio Descriptions

链接https://arxiv.org/abs/2510.25440

作者:Eshika Khandelwal,Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Andrew Zisserman,Gül Varol,Makarand Tapaswi

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:essential on-screen information, allowing visually impaired, visually impaired audiences, convey essential on-screen, Audio Descriptions

备注

点击查看摘要

Abstract:Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

11. 【2510.25387】Instance-Level Composed Image Retrieval

链接https://arxiv.org/abs/2510.25387

作者:Bill Psomas,George Retsinas,Nikos Efthymiadis,Panagiotis Filntisis,Yannis Avrithis,Petros Maragos,Ondrej Chum,Giorgos Tolias

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:popular research direction, composed image retrieval, progress of composed, held back, absence of high-quality

备注: NeurIPS 2025

点击查看摘要

Abstract:The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: this https URL.

Comments:
NeurIPS 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.25387 [cs.CV]

(or
arXiv:2510.25387v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.25387

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
12. 【2510.25372】Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

链接https://arxiv.org/abs/2510.25372

作者:M Yashwanth,Sharannya Ghosh,Aditay Tripathi,Anirban Chakraborty

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Visual Prompt Tuning, proven highly effective, adapting large models, Federated Prompt Tuning, Prompt Tuning

备注

点击查看摘要

Abstract:Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

13. 【2510.25347】3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework

链接https://arxiv.org/abs/2510.25347

作者:Ayman Abaid,Gianpiero Guidone,Sara Alsubai,Foziyah Alquahtani,Talha Iqbal,Ruth Sharif,Hesham Elzomor,Emiliano Bianchini,Naeif Almagal,Michael G. Madden,Faisal Sharif,Ihsan Ullah

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:coronary artery disease, Coronary artery, Coronary artery calcium, artery disease, scoring plays

备注: 11 pages, 2 Figures, MICCAI AMAI 2025 workshop, to be published in Volume 16206 of the Lecture Notes in Computer Science series

点击查看摘要

Abstract:Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p0.05), despite the unavailability of expert annotations.

14. 【2510.25345】Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

链接https://arxiv.org/abs/2510.25345

作者:Zhigang Tu,Zhengbo Zhang,Jia Gong,Junsong Yuan,Bo Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Skeleton-based human action, human action recognition, action recognition, classify human skeletal, Skeleton-based human

备注: Accepted by IEEE Transactions on Image Processing (TIP), 2025

点击查看摘要

Abstract:Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.

15. 【2510.25332】StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

链接https://arxiv.org/abs/2510.25332

作者:Yuhang Hu,Zhenyu Yang,Shihan Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Changsheng Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video applications demands, Video Question Answering, applications demands multimodal, streaming video applications, current Video Question

备注

点击查看摘要

Abstract:The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at this https URL.

16. 【2510.25327】MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

链接https://arxiv.org/abs/2510.25327

作者:Runxi Huang,Mingxuan Yu,Mingyu Tsoi,Xiaomin Ouyang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:resource-constrained edge devices, Real-time multimodal inference, human-computer interaction, autonomous driving, mobile health

备注: Accepted by SenSys 2026

点击查看摘要

Abstract:Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

17. 【2510.25318】Prototype-Driven Adaptation for Few-Shot Object Detection

链接https://arxiv.org/abs/2510.25318

作者:Yushen Huang,Zhiming Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot object detection, Few-shot object, object detection, suffers from base-class, base-class bias

备注: 7 pages,1 figure,2 tables,Preprint

点击查看摘要

Abstract:Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based "second opinion" complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.

18. 【2510.25314】Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design

链接https://arxiv.org/abs/2510.25314

作者:Zongxi Yu,Xiaolong Qian,Shaohua Gao,Qi Jiang,Yao Gao,Kailun Yang,Kaiwei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Optics (physics.optics)

关键词:unreliable semantic priors, ill-posed problem reliant, Monocular Depth Estimation, struggle with RGB, RGB sharpness

备注: The source code will be publicly available at [this https URL](https://github.com/ZongxiYu-ZJU/BMI)

点击查看摘要

Abstract:Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at this https URL.

19. 【2510.25301】GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

链接https://arxiv.org/abs/2510.25301

作者:Yang Jin,Guangyu Guo,Binglu Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaze object detection, interpreting human gaze, human gaze behavior, Gaze, Gaze object

备注

点击查看摘要

Abstract:Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

20. 【2510.25279】Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation

链接https://arxiv.org/abs/2510.25279

作者:Yuyang Huang,Yabo Chen,Junyu Zhou,Wenrui Dai,Xiaopeng Zhang,Junni Zou,Hongkai Xiong,Qi Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Source-free domain adaptation, tackles domain shifts, SFDA methods, Source-free domain, pre-trained source model

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.

21. 【2510.25268】SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation

链接https://arxiv.org/abs/2510.25268

作者:Wang zhi,Yuyan Liu,Liu Liu,Li Zhang,Ruixuan Lu,Dan Guo

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:widely studied topic, Generating hand grasps, Generating hand, HAOI, hand

备注

点击查看摘要

Abstract:Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.

22. 【2510.25263】LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

链接https://arxiv.org/abs/2510.25263

作者:Yang Miao,Jan-Nico Zaech,Xi Wang,Fabien Despinoy,Danda Pani Paudel,Luc Van Gool

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Model, Multimodal Large Language, Multimodal Large, Language Model, Large Language

备注: 10 pages, 5 figures, 14 tables, Neurips 2025

点击查看摘要

Abstract:We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

23. 【2510.25257】RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models

链接https://arxiv.org/abs/2510.25257

作者:Zijun Liao,Yian Zhao,Xin Shan,Yu Yan,Chang Liu,Lei Lu,Xiangyang Ji,Jie Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved substantial progress, meticulously designed architectures, optimization strategies, achieved substantial, substantial progress

备注

点击查看摘要

Abstract:Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.

24. 【2510.25239】Mapping and Classification of Trees Outside Forests using Deep Learning

链接https://arxiv.org/abs/2510.25239

作者:Moritz Lucas,Hamid Ebrahimy,Viacheslav Barkov,Ralf Pecenka,Kai-Uwe Kühnberger,Björn Waske

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sequestering carbon, play an important, supporting biodiversity, regulating microclimates, important role

备注

点击查看摘要

Abstract:Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at this https URL

25. 【2510.25238】VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

链接https://arxiv.org/abs/2510.25238

作者:Qianqian Qiao,DanDan Zheng,Yihang Bo,Bao Peng,Heng Huang,Longteng Jiang,Huaye Wang,Jingdong Chen,Jun Zhou,Xin Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:integrates computer vision, multimedia computing, integrates computer, human cognition, vital area

备注

点击查看摘要

Abstract:Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at this https URL.

26. 【2510.25237】DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

链接https://arxiv.org/abs/2510.25237

作者:Yinqi Cai,Jichang Li,Zhaolun Li,Weikai Chen,Rushi Lan,Xi Xie,Xiaonan Luo,Guanbin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:raising significant concerns, manipulate face videos, Recent advances, deep generative models, face videos

备注: ICCV 2025

点击查看摘要

Abstract:Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

27. 【2510.25234】Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation

链接https://arxiv.org/abs/2510.25234

作者:Yuxiang Mao,Zhijie Zhang,Zhiheng Zhang,Jiawei Liu,Chen Zeng,Shihong Xia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:conveying human emotions, fundamental to conveying, conveying human, human emotions, animation

备注: 18 pages, 6 figures, accepted to ICXR 2025 conference

点击查看摘要

Abstract:Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.

28. 【2510.25229】Balanced conic rectified flow

链接https://arxiv.org/abs/2510.25229

作者:Kim Shin Seong,Mingi Kwon,Jaeseok Jeong,Youngjung Uh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ordinary differential equation, smooth transport mappings, learns smooth transport, Rectified flow, differential equation

备注: Main paper: 10 pages (total 40 pages including appendix), 5 figures. Accepted at NeurIPS 2025 (Poster). Acknowledgment: Supported by the NRF of Korea (RS-2023-00223062) and IITP grants (RS-2020-II201361, RS-2024-00439762) funded by the Korean government (MSIT)

点击查看摘要

Abstract:Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.

Comments:
Main paper: 10 pages (total 40 pages including appendix), 5 figures. Accepted at NeurIPS 2025 (Poster). Acknowledgment: Supported by the NRF of Korea (RS-2023-00223062) and IITP grants (RS-2020-II201361, RS-2024-00439762) funded by the Korean government (MSIT)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68T07, 68T45, 65C20

ACMclasses:
I.2.10; I.4.9; I.2.6

Cite as:
arXiv:2510.25229 [cs.CV]

(or
arXiv:2510.25229v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.25229

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Proceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

29. 【2510.25227】Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation

链接https://arxiv.org/abs/2510.25227

作者:Quang-Khai Bui-Tran,Thanh-Huy Nguyen,Hoang-Thien Nguyen,Ba-Thinh Lam,Nguyen Lan Vi Vu,Phat K. Huynh,Ulas Bagci,Min Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ignore sample difficulty, Hard Sample Selection, Source-Free Domain Adaptation, Denoised Patch Mixing, leverages Hard Sample

备注: 5 pages, 3 figures

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.

30. 【2510.25221】MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo

链接https://arxiv.org/abs/2510.25221

作者:Shiyu Qin,Zhihao Cai,Kaixuan Wang,Lin Qi,Junyu Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shading cues derived, Photometric stereo, lighting conditions, technique aimed, aimed at determining

备注

点击查看摘要

Abstract:Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.

31. 【2510.25210】U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching

链接https://arxiv.org/abs/2510.25210

作者:Junsheng Zhou,Xingyu Shi,Haichuan Song,Yi Fang,Yu-Shen Liu,Zhizhong Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly negative impact, Point clouds captured, point cloud, downstream tasks, surface reconstruction

备注: Accepted by NeurIPS 2025. Project page: [this https URL](https://gloriasze.github.io/U-CAN/)

点击查看摘要

Abstract:Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.

32. 【2510.25199】AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis

链接https://arxiv.org/abs/2510.25199

作者:Manisha More,Kavya Bhand,Kaustubh Mukdam,Kavya Sharma,Manas Kawtikwar,Hridayansh Kaware,Prajwal Kavhar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reduce treatment costs, significantly improve patient, improve patient survival, treatment costs, diagnosis of critical

备注

点击查看摘要

Abstract:Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.

33. 【2510.25184】Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks

链接https://arxiv.org/abs/2510.25184

作者:Zhifeng Wang,Minghui Wang,Chunyan Zeng,Jialong Yao,Yang Yang,Hongmin Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transformative phase characterized, ushered school education, artificial intelligence, heightened intelligence, rapid advancement

备注: 9 pages, 10 figures

点击查看摘要

Abstract:In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper's focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals' faces culled from images captured by students' open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students' face databases is performed, effectively ascertaining the identity of each student.

34. 【2510.25175】st-Time Adaptive Object Detection with Foundation Model

链接https://arxiv.org/abs/2510.25175

作者:Yingjie Gao,Yanan Zhang,Zhi Cai,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world application scenarios, attracted increasing attention, increasing attention due, adaptive object detection, test-time adaptive object

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at this https URL.

35. 【2510.25174】Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation

链接https://arxiv.org/abs/2510.25174

作者:Huadong Tang,Youpeng Zhao,Min Xu,Jun Wang,Qiang Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Prevalent semantic segmentation, methods generally adopt, Toggle, Prevalent semantic, segmentation methods generally

备注: Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM)

点击查看摘要

Abstract:Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model's effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context.

Comments:
Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.25174 [cs.CV]

(or
arXiv:2510.25174v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.25174

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Huadong Tang [view email] [v1]
Wed, 29 Oct 2025 05:17:13 UTC (38,202 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation, by Huadong Tang and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2025-10

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

36. 【2510.25173】$D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction

链接https://arxiv.org/abs/2510.25173

作者:Kejing Xia,Jidong Jia,Ke Jin,Yucai Bai,Li Sun,Dacheng Tao,Youjian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:urban scene reconstruction, shown great potential, Gaussian Splatting, urban scene, scene reconstruction

备注

点击查看摘要

Abstract:Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.

37. 【2510.25166】A Study on Inference Latency for Vision Transformers on Mobile Devices

链接https://arxiv.org/abs/2510.25166

作者:Zhuojin Li,Marco Paolieri,Leana Golubchik

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)

关键词:mobile devices, real-world vision transformers, vision transformers, machine learning techniques, computer vision

备注: To appear in Springer LNICST, volume 663, Proceedings of VALUETOOLS 2024

点击查看摘要

Abstract:Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

38. 【2510.25163】arget-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation

链接https://arxiv.org/abs/2510.25163

作者:Wenhao Zheng,Chenwei Sun,Wenbo Zhang,Jiancheng Lv,Xianggen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep generative models, simplified continuity assumptions, shown promising progress, Deep generative, CAD sequences

备注

点击查看摘要

Abstract:Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at this https URL.

39. 【2510.25157】owards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers

链接https://arxiv.org/abs/2510.25157

作者:Gautam A. Viruthagiri,Arnuv Tandon,Gerald G. Fuller,Vinny Chandran Suja

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ill-posed inverse problem, inverse problem complicated, non-invasively measuring liquid, applications in ophthalmology, interference patterns

备注: 6 pages, 2 figures, will be updated

点击查看摘要

Abstract:Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.

40. 【2510.25146】EA3D: Online Open-World 3D Object Extraction from Streaming Videos

链接https://arxiv.org/abs/2510.25146

作者:Xiaoyu Zhou,Jingqi Wang,Yuang Jia,Yongtao Wang,Deqing Sun,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offline-collected multi-view data, holistic scene understanding, scene understanding, data or pre-constructed, limited by offline-collected

备注: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)

点击查看摘要

Abstract:Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

41. 【2510.25141】Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective

链接https://arxiv.org/abs/2510.25141

作者:Wan Jiang,Jing Yan,Ruixuan Zhang,Xiaojing Chen,Changtao Miao,Zhe Li,Chenhao Lin,Yunfeng Diao,Richang Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generative Artificial Intelligence, Artificial Intelligence, made detecting AI-generated, generative Artificial, detecting AI-generated images

备注

点击查看摘要

Abstract:The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.

42. 【2510.25140】DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

链接https://arxiv.org/abs/2510.25140

作者:Malaisree P,Youwai S,Kitkobsin T,Janrungautai S,Amorndechaphon D,Rojanavasu P

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited annotated data, Object detection, specialized domains, images, applications is constrained

备注

点击查看摘要

Abstract:Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

43. 【2510.25134】Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks

链接https://arxiv.org/abs/2510.25134

作者:Qingdong Cai,Charith Abhayaratne

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Class Activation Mapping, weakly supervised learning, weakly supervised, supervised learning tasks, Class Activation

备注: Preprint for journal paper

点击查看摘要

Abstract:Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.

44. 【2510.25129】AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians

链接https://arxiv.org/abs/2510.25129

作者:Xiyu Zhang,Chong Bao,Yipeng Chen,Hongjia Zhai,Yitong Dong,Hujun Bao,Zhaopeng Cui,Guofeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prominent research topic, indoor and urban, downstream applications, Gaussian Splatting, prominent research

备注: 18 pages, 11 figures. NeurIPS 2025; Project page: [this https URL](https://zju3dv.github.io/AtlasGS/)

点击查看摘要

Abstract:3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

45. 【2510.25094】Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

链接https://arxiv.org/abs/2510.25094

作者:Chanhyeong Yang,Taehoon Song,Jihwan Park,Hyunwoo J. Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:specific verb-object pairs, Human-Object Interaction detection, Interaction detection aims, Zero-shot Human-Object Interaction, unseen during training

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at this https URL.

46. 【2510.25084】PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes

链接https://arxiv.org/abs/2510.25084

作者:Xiang liu,Zhaoxiang Liu,Huan Hu,Zipeng Wang,Ping Chen,Zezhou Chen,Kai Wang,Shiguo Lian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, significantly improved facial, facial, social media, facial attributes

备注: Accepted by Image and Vision Computing (18 pages, 8 figures)

点击查看摘要

Abstract:Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at this https URL.

47. 【2510.25077】Neighborhood Feature Pooling for Remote Sensing Image Classification

链接https://arxiv.org/abs/2510.25077

作者:Fahimeh Orvati Nia,Amirmohammad Mohammadi,Salim Al Kharsa,Pragati Naikare,Zigfried Hampel-Arias,Joshua Peeples

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:sensing image classification, remote sensing image, neighborhood feature pooling, texture feature extraction, propose neighborhood feature

备注: 9 pages, 5 figures. Accepted to WACV 2026 (Winter Conference on Applications of Computer Vision)

点击查看摘要

Abstract:In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.

48. 【2510.25070】Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

链接https://arxiv.org/abs/2510.25070

作者:Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:settings presents major, presents major challenges, major challenges due, real-world settings presents, settings presents

备注: Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

点击查看摘要

Abstract:Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.

49. 【2510.25067】DRIP: Dynamic patch Reduction via Interpretable Pooling

链接https://arxiv.org/abs/2510.25067

作者:Yusen Peng,Sachin Kumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:instruction tuning, including contrastive pretraining, advances in vision-language, greatly pushed, pushed the frontier

备注

点击查看摘要

Abstract:Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.

50. 【2510.25058】Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge

链接https://arxiv.org/abs/2510.25058

作者:Andriy Myronenko,Dong Yang,Yufan He,Daguang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pediatic Glioma challenges, describe our solution, Brain Metastasis, Brain Meningioma, place results

备注: BraTS23 winner

点击查看摘要

Abstract:In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.

51. 【2510.25051】Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models

链接https://arxiv.org/abs/2510.25051

作者:Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:commonly diagnosed malignancy, Breast cancer remains, developed world, commonly diagnosed, diagnosed malignancy

备注: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025

点击查看摘要

Abstract:Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.

52. 【2510.25032】Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8

链接https://arxiv.org/abs/2510.25032

作者:Zahra Ebrahimi Vargoorani,Amir Mohammad Ghoreyshi,Ching Yee Suen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:highly accurate automatic, Developing a highly, accurate automatic license, automatic license plate, highly accurate

备注: 6 pages, 8 figures. Presented at 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), August 31 - September 3, 2025, Istanbul, Turkey

点击查看摘要

Abstract:Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.

53. 【2510.25002】Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission

链接https://arxiv.org/abs/2510.25002

作者:Zhenyu Liu,Yi Ma,Rahim Tafazolli,Zhi Ding

类目:Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词:remains highly challenging, networks remains highly, advanced deep models, wireless networks remains, highly challenging

备注

点击查看摘要

Abstract:Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.

54. 【2510.24980】FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning

链接https://arxiv.org/abs/2510.24980

作者:Reza Saadati Fard,Emmanuel Agu,Palawat Busaranuvong,Deepak Kumar,Shefalika Gautam,Bengisu Tulu,Diane Strong,Lorraine Loretz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:prevalent healthcare concern, Convolutional Neural Networks, healthcare concern, prevalent healthcare, Stages I-IV

备注

点击查看摘要

Abstract:Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.

55. 【2510.24949】SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving

链接https://arxiv.org/abs/2510.24949

作者:Anil Yildiz,Sarah M. Thornton,Carl Hildebrandt,Sreeja Roy-Singh,Mykel J. Kochenderfer

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Assessing scenario coverage, computationally intensive Large, intensive Large Vision-Language, Assessing scenario, Large Vision-Language Models

备注

点击查看摘要

Abstract:Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.

56. 【2510.24936】IBIS: A Powerful Hybrid Architecture for Human Activity Recognition

链接https://arxiv.org/abs/2510.24936

作者:Alison M. Fernandes,Hermes I. Del Monego,Bruno S. Chang,Anelise Munaretto,Hélder M. Fontes,Rui L. Campos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:space occupancy analysis, gesture-based IoT control, Wi-Fi sensing stems, capture environmental data, making it ideal

备注: 8 pages. 8 figures. Wireless Days Conference, December 2025

点击查看摘要

Abstract:The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.

57. 【2510.24919】Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

链接https://arxiv.org/abs/2510.24919

作者:Hossein R. Nowdeh,Jie Ji,Xiaolong Ma,Fatemeh Afghah

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:limiting generalization, Modality-Aware Sharpness-Aware Minimization, dominant modality, M-SAM, dominant

备注

点击查看摘要

Abstract:In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.

58. 【2510.24907】Understanding Multi-View Transformers

链接https://arxiv.org/abs/2510.24907

作者:Michal Stary,Julien Gaubil,Ayush Tewari,Vincent Sitzmann

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:vision by solving, Multi-view transformers, feed-forward manner, multi-view transformers' layers, Multi-view

备注: Presented at the ICCV 2025 E2E3D Workshop

点击查看摘要

Abstract:Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at this https URL .

59. 【2510.24904】VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos

链接https://arxiv.org/abs/2510.24904

作者:Qiucheng Wu,Handong Zhao,Zhixin Shu,Jing Shi,Yang Zhang,Shiyu Chang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:external camera controls, unconventional camera motions, camera motions, text descriptions, struggle to generalize

备注: 19 pages, 9 figures

点击查看摘要

Abstract:Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in this https URL .

60. 【2510.24902】Pixels to Signals: A Real-Time Framework for Traffic Demand Estimation

链接https://arxiv.org/abs/2510.24902

作者:H Mhatre,M Vyas,A Mittal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:growing urban cities, urban transportation systems, rapidly growing urban, urban cities, growing urban

备注

点击查看摘要

Abstract:Traffic congestion is becoming a challenge in the rapidly growing urban cities, resulting in increasing delays and inefficiencies within urban transportation systems. To address this issue a comprehensive methodology is designed to optimize traffic flow and minimize delays. The framework is structured with three primary components: (a) vehicle detection, (b) traffic prediction, and (c) traffic signal optimization. This paper presents the first component, vehicle detection. The methodology involves analyzing multiple sequential frames from a camera feed to compute the background, i.e. the underlying roadway, by averaging pixel values over time. The computed background is then utilized to extract the foreground, where the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied to detect vehicles. With its computational efficiency and minimal infrastructure modification requirements, the proposed methodology offers a practical and scalable solution for real-world deployment.

61. 【2510.24887】Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS

链接https://arxiv.org/abs/2510.24887

作者:Daniele L. V. dos Santos,Thiago B. Pereira,Carlos Eduardo G. R. Alves,Richard J. M. G. Tello,Francisco de A. Boldt,Thiago M. Paixão

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Brazilian Sign Language, Brazilian Sign, body landmark detection, paper investigates, investigates the feasibility

备注: Submitted to Int. Conf. on Computer Vision Theory and Applications (VISAPP 2026)

点击查看摘要

Abstract:This paper investigates the feasibility of using lightweight body landmark detection for the recognition of isolated signs in Brazilian Sign Language (LIBRAS). Although the skeleton-based approach by Alves et al. (2024) enabled substantial improvements in recognition performance, the use of OpenPose for landmark extraction hindered time performance. In a preliminary investigation, we observed that simply replacing OpenPose with the lightweight MediaPipe, while improving processing speed, significantly reduced accuracy. To overcome this limitation, we explored landmark subset selection strategies aimed at optimizing recognition performance. Experimental results showed that a proper landmark subset achieves comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to Alves et al. (2024). As an additional contribution, we demonstrated that spline-based imputation effectively mitigates missing landmark issues, leading to substantial accuracy gains. These findings highlight that careful landmark selection, combined with simple imputation techniques, enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.

62. 【2510.24885】FruitProm: Probabilistic Maturity Estimation and Detection of Fruits and Vegetables

链接https://arxiv.org/abs/2510.24885

作者:Sidharth Rai,Rahul Harsha Cheppally,Benjamin Vail,Keziban Yalçın Dokumacı,Ajay Sharda

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:directly impacting yield, impacting yield prediction, agricultural automation, directly impacting, impacting yield

备注: Sidharth Rai, Rahul Harsha Cheppally contributed equally to this work

点击查看摘要

Abstract:Maturity estimation of fruits and vegetables is a critical task for agricultural automation, directly impacting yield prediction and robotic harvesting. Current deep learning approaches predominantly treat maturity as a discrete classification problem (e.g., unripe, ripe, overripe). This rigid formulation, however, fundamentally conflicts with the continuous nature of the biological ripening process, leading to information loss and ambiguous class boundaries. In this paper, we challenge this paradigm by reframing maturity estimation as a continuous, probabilistic learning task. We propose a novel architectural modification to the state-of-the-art, real-time object detector, RT-DETRv2, by introducing a dedicated probabilistic head. This head enables the model to predict a continuous distribution over the maturity spectrum for each detected object, simultaneously learning the mean maturity state and its associated uncertainty. This uncertainty measure is crucial for downstream decision-making in robotics, providing a confidence score for tasks like selective harvesting. Our model not only provides a far richer and more biologically plausible representation of plant maturity but also maintains exceptional detection performance, achieving a mean Average Precision (mAP) of 85.6\% on a challenging, large-scale fruit dataset. We demonstrate through extensive experiments that our probabilistic approach offers more granular and accurate maturity assessments than its classification-based counterparts, paving the way for more intelligent, uncertainty-aware automated systems in modern agriculture

63. 【2510.24870】Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

链接https://arxiv.org/abs/2510.24870

作者:Alexander Martin,William Walden,Reno Kriz,Dengjia Zhang,Kate Sanders,Eugene Yang,Chihsheng Jin,Benjamin Van Durme

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:RAG, framework for retrieval-augmented, retrieval-augmented generation, multimodal RAG, multimodal RAG evaluation

备注: [this https URL](https://github.com/alexmartin1722/mirage)

点击查看摘要

Abstract:We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.

64. 【2510.24830】he Generation Phases of Flow Matching: a Denoising Perspective

链接https://arxiv.org/abs/2510.24830

作者:Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:achieved remarkable success, remain poorly understood, process remain poorly, generation process remain, remarkable success

备注

点击查看摘要

Abstract:Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.

65. 【2510.24827】MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition

链接https://arxiv.org/abs/2510.24827

作者:Haoyang Zhang,Zhou Yang,Ke Sun,Yucai Pang,Guoliang Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:future human-computer interaction, crucial for future, future human-computer, emotion recognition, Multimodal emotion recognition

备注: The paper will be published in the MMAsia2025 conference proceedings

点击查看摘要

Abstract:Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.

66. 【2510.24821】Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

链接https://arxiv.org/abs/2510.24821

作者:Inclusion AI:Bowen Ma,Cheng Zou,Canxiang Yan,Chunxiang Jin,Chunjie Shen,Dandan Zheng,Fudong Wang,Furong Xu,GuangMing Yao,Jun Zhou,Jingdong Chen,Jianing Li,Jianxin Sun,Jiajia Liu,Jianjiang Zhu,Jianping Jiang,Jun Peng,Kaixiang Ji,Kaimeng Ren,Libin Wang,Lixiang Ru,Longhua Tan,Lan Wang,Mochen Bai,Ning Gao,Qingpei Guo,Qinglong Zhang,Qiang Xu,Rui Liu,Ruijie Xiong,Ruobing Zheng,Sirui Gao,Tianqi Li,Tinghao Liu,Weilong Chai,Xinyu Xiao,Xiaomei Wang,Xiaolong Wang,Xiao Lu,Xiaoyu Li,Xingning Dong,Xuzheng Yu,Yi Yuan,Yuting Gao,Yuting Xiao,Yunxiao Sun,Yipeng Chen,Yifan Mao,Yifei Wu,Yongjie Lyu,Ziping Ma,Zhiqiang Fang,Zhihao Qiu,Ziyuan Huang,Zizheng Yang,Zhengyu He

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:billion total parameters, Artificial General Intelligence, billion total, total parameters, active per token

备注: 18 pages, 5 figures

点击查看摘要

Abstract:We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

67. 【2510.24820】SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

链接https://arxiv.org/abs/2510.24820

作者:Ruiyang Zhang,Jiahao Luo,Xiaoru Feng,Qiufan Pang,Yaodong Yang,Juntao Dai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:increasingly critical, rapid advancement, safety, safety editing, multi-round safety editing

备注

点击查看摘要

Abstract:With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.

68. 【2510.24816】Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

链接https://arxiv.org/abs/2510.24816

作者:Cui Yakun,Fushuo Huo,Weijie Shi,Juntao Dai,Hang Du,Zhenghao Zhu,Sirui Han,Yike Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:greatly advanced research, multi-modal large language, large language models, Multi-modal Video Fake, Video fake

备注

点击查看摘要

Abstract:The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.

69. 【2510.24814】Deep Feature Optimization for Enhanced Fish Freshness Assessment

链接https://arxiv.org/abs/2510.24814

作者:Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ensuring food safety, minimizing economic losses, Assessing fish freshness, Assessing fish, seafood industry

备注: 39 pages; 10 tables; 9 figures

点击查看摘要

Abstract:Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry. However, traditional sensory evaluation remains subjective, time-consuming, and inconsistent. Although recent advances in deep learning have automated visual freshness prediction, challenges related to accuracy and feature transparency persist. This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment. First, five state-of-the-art vision architectures - ResNet-50, DenseNet-121, EfficientNet-B0, ConvNeXt-Base, and Swin-Tiny - are fine-tuned to establish a strong baseline. Next, multi-level deep features extracted from these backbones are used to train seven classical machine learning classifiers, integrating deep and traditional decision mechanisms. Finally, feature selection methods based on Light Gradient Boosting Machine (LGBM), Random Forest, and Lasso identify a compact and informative subset of features. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate that the best configuration combining Swin-Tiny features, an Extra Trees classifier, and LGBM-based feature selection achieves an accuracy of 85.99%, outperforming recent studies on the same dataset by 8.69-22.78%. These findings confirm the effectiveness and generalizability of the proposed framework for visual quality evaluation tasks.

70. 【2510.24813】DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

链接https://arxiv.org/abs/2510.24813

作者:Binbin Li,Guimiao Yang,Zisen Qi,Haiping Wang,Yu Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent lightweight retrieval-augmented, utilize retrieved data, retrieved data solely, Recent lightweight, visual features unenhanced

备注

点击查看摘要

Abstract:Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.

71. 【2510.24804】Conflict Adaptation in Vision-Language Models

链接https://arxiv.org/abs/2510.24804

作者:Xiaoyang Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:improved performance, high-conflict trial, cognitive control, human cognitive control, conflict adaptation

备注: Workshop on Interpreting Cognition in Deep Learning Models at NeurIPS 2025

点击查看摘要

Abstract:A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.

72. 【2510.24795】A Survey on Efficient Vision-Language-Action Models

链接https://arxiv.org/abs/2510.24795

作者:Zhaoshu Yu,Bo Wang,Pengpeng Zeng,Haonan Zhang,Ji Zhang,Lianli Gao,Jingkuan Song,Nicu Sebe,Heng Tao Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:bridge digital knowledge, represent a significant, embodied intelligence, aiming to bridge, physical-world interaction

备注: 26 pages, 8 figures

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: this https URL

73. 【2510.24792】PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

链接https://arxiv.org/abs/2510.24792

作者:Patrick Haller,Fabio Barth,Jonas Golde,Georg Rehm,Alan Akbik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:demonstrated remarkable progress, demonstrated remarkable, remarkable progress, Vision-language models, English

备注: 8 pages, 11 tables and figures

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

74. 【2510.24791】A Re-node Self-training Approach for Deep Graph-based Semi-supervised Classification on Multi-view Image Data

链接https://arxiv.org/abs/2510.24791

作者:Jingjun Bi,Fadi Dornaika

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:gained attention due, extensive data annotations, graph-based semi-supervised learning, gained attention, attention due

备注

点击查看摘要

Abstract:Recently, graph-based semi-supervised learning and pseudo-labeling have gained attention due to their effectiveness in reducing the need for extensive data annotations. Pseudo-labeling uses predictions from unlabeled data to improve model training, while graph-based methods are characterized by processing data represented as graphs. However, the lack of clear graph structures in images combined with the complexity of multi-view data limits the efficiency of traditional and existing techniques. Moreover, the integration of graph structures in multi-view data is still a challenge. In this paper, we propose Re-node Self-taught Graph-based Semi-supervised Learning for Multi-view Data (RSGSLM). Our method addresses these challenges by (i) combining linear feature transformation and multi-view graph fusion within a Graph Convolutional Network (GCN) framework, (ii) dynamically incorporating pseudo-labels into the GCN loss function to improve classification in multi-view data, and (iii) correcting topological imbalances by adjusting the weights of labeled samples near class boundaries. Additionally, (iv) we introduce an unsupervised smoothing loss applicable to all samples. This combination optimizes performance while maintaining computational efficiency. Experimental results on multi-view benchmark image datasets demonstrate that RSGSLM surpasses existing semi-supervised learning approaches in multi-view contexts.

75. 【2510.24788】he Underappreciated Power of Vision Models for Graph Structural Understanding

链接https://arxiv.org/abs/2510.24788

作者:Xinjian Zhao,Wei Pang,Zhongkai Xue,Xiangru Jian,Lei Zhang,Yaoyao Xu,Xiaozhuang Song,Shu Wu,Tianshu Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Neural Networks operate, Graph Neural Networks, Neural Networks, Networks operate, human visual perception

备注: NeurIPS 2025

点击查看摘要

Abstract:Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.

76. 【2510.24787】ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

链接https://arxiv.org/abs/2510.24787

作者:Mingzhi Zhu,Ding Shang,Sai Qian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Photorealistic Codec Avatars, Virtual Reality, deep learning-based generative, learning-based generative models, Photorealistic Codec

备注

点击查看摘要

Abstract:Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.

77. 【2510.24778】FPGA-based Lane Detection System incorporating Temperature and Light Control Units

链接https://arxiv.org/abs/2510.24778

作者:Ibrahim Qamar,Saber Mahmoud,Seif Megahed,Mohamed Khaled,Saleh Hesham,Ahmed Matar,Saif Gebril,Mervat Mahmoud

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:important outcomes gained, Intelligent vehicles, Detector Vehicle LDV, Lane Detector Vehicle, tendency toward automation

备注: 5 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.

78. 【2510.24777】Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer's Disease Diagnosis

链接https://arxiv.org/abs/2510.24777

作者:Yujie Nie,Jianzhang Ni,Yonglong Ye,Yuan-Ting Zhang,Yun Kwok Wing,Xiangqing Xu,Xin Ma,Lizhou Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:slowing disease progression, enabling timely intervention, Alzheimer disease, disease progression, slowing disease

备注: 35 pages, 8 figures, and 7 tables

点击查看摘要

Abstract:Accurate diagnosis of Alzheimer's disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.

79. 【2510.24773】Point-level Uncertainty Evaluation of Mobile Laser Scanning Point Clouds

链接https://arxiv.org/abs/2510.24773

作者:Ziyang Xu,Olaf Wysocki,Christoph Holst

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:Mobile Laser Scanning, Laser Scanning, Mobile Laser, Reliable quantification, essential for ensuring

备注

点击查看摘要

Abstract:Reliable quantification of uncertainty in Mobile Laser Scanning (MLS) point clouds is essential for ensuring the accuracy and credibility of downstream applications such as 3D mapping, modeling, and change analysis. Traditional backward uncertainty modeling heavily rely on high-precision reference data, which are often costly or infeasible to obtain at large scales. To address this issue, this study proposes a machine learning-based framework for point-level uncertainty evaluation that learns the relationship between local geometric features and point-level errors. The framework is implemented using two ensemble learning models, Random Forest (RF) and XGBoost, which are trained and validated on a spatially partitioned real-world dataset to avoid data leakage. Experimental results demonstrate that both models can effectively capture the nonlinear relationships between geometric characteristics and uncertainty, achieving mean ROC-AUC values above 0.87. The analysis further reveals that geometric features describing elevation variation, point density, and local structural complexity play a dominant role in predicting uncertainty. The proposed framework offers a data-driven perspective of uncertainty evaluation, providing a scalable and adaptable foundation for future quality control and error analysis of large-scale point clouds.

80. 【2510.24768】Combining SAR Simulators to Train ATR Models with Synthetic Data

链接https://arxiv.org/abs/2510.24768

作者:Benjamin Camus,Julien Houssay,Corentin Le Barbu,Eric Monteux,Cédric Saleun(a href="http://DGA.MI" rel="external noopener nofollow" class="link-external link-http"this http URL/a),Christian Cochin(a href="http://DGA.MI" rel="external noopener nofollow" class="link-external link-http"this http URL/a)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

关键词:Automatic Target Recognition, perform Automatic Target, Synthetic Aperture Radar, Target Recognition, Aperture Radar

备注

点击查看摘要

Abstract:This work aims to train Deep Learning models to perform Automatic Target Recognition (ATR) on Synthetic Aperture Radar (SAR) images. To circumvent the lack of real labelled measurements, we resort to synthetic data produced by SAR simulators. Simulation offers full control over the virtual environment, which enables us to generate large and diversified datasets at will. However, simulations are intrinsically grounded on simplifying assumptions of the real world (i.e. physical models). Thus, synthetic datasets are not as representative as real measurements. Consequently, ATR models trained on synthetic images cannot generalize well on real measurements. Our contributions to this problem are twofold: on one hand, we demonstrate and quantify the impact of the simulation paradigm on the ATR. On the other hand, we propose a new approach to tackle the ATR problem: combine two SAR simulators that are grounded on different (but complementary) paradigms to produce synthetic datasets. To this end, we use two simulators: MOCEM, which is based on a scattering centers model approach, and Salsa, which resorts on a ray tracing strategy. We train ATR models using synthetic dataset generated both by MOCEM and Salsa and our Deep Learning approach called ADASCA. We reach an accuracy of almost 88 % on the MSTAR measurements.

81. 【2510.24767】owards Fine-Grained Human Motion Video Captioning

链接https://arxiv.org/abs/2510.24767

作者:Guorui Song,Guocun Wang,Zhe Huang,Jing Lin,Xuefei Zhe,Jian Li,Haoqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Generating accurate descriptions, Generating accurate, accurate descriptions, remains a challenging, challenging task

备注

点击查看摘要

Abstract:Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.

82. 【2510.24734】DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

链接https://arxiv.org/abs/2510.24734

作者:Qirui Hou,Wenzhang Sun,Chang Zeng,Chunfeng Wang,Hao Li,Jianxun Cui

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:quality and efficiency, challenged by complex, struggling to balance, balance quality, Real-time

备注: Autonomous Driving, Novel view Synthesis, Multi task Learning

点击查看摘要

Abstract:Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.

83. 【2510.24720】Modelling the Interplay of Eye-Tracking Temporal Dynamics and Personality for Emotion Detection in Face-to-Face Settings

链接https://arxiv.org/abs/2510.24720

作者:Meisam J. Seikavandi,Jostein Fimland,Fabricio Batista Narcizo,Maria Barrett,Ted Vucurevich,Jesper Bünsow Boldt,Andrew Burke Dittberner,Paolo Burelli

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:adaptive human-computer interaction, conversation-like settings, human-computer interaction, Accurate recognition, critical for adaptive

备注

点击查看摘要

Abstract:Accurate recognition of human emotions is critical for adaptive human-computer interaction, yet remains challenging in dynamic, conversation-like settings. This work presents a personality-aware multimodal framework that integrates eye-tracking sequences, Big Five personality traits, and contextual stimulus cues to predict both perceived and felt emotions. Seventy-three participants viewed speech-containing clips from the CREMA-D dataset while providing eye-tracking signals, personality assessments, and emotion ratings. Our neural models captured temporal gaze dynamics and fused them with trait and stimulus information, yielding consistent gains over SVM and literature baselines. Results show that (i) stimulus cues strongly enhance perceived-emotion predictions (macro F1 up to 0.77), while (ii) personality traits provide the largest improvements for felt emotion recognition (macro F1 up to 0.58). These findings highlight the benefit of combining physiological, trait-level, and contextual information to address the inherent subjectivity of emotion. By distinguishing between perceived and felt responses, our approach advances multimodal affective computing and points toward more personalized and ecologically valid emotion-aware systems.

84. 【2510.25164】ransformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

链接https://arxiv.org/abs/2510.25164

作者:Yogesh Thakku Suresh,Vishwajeet Shivaji Hogale,Luca-Alexandru Zamfira,Anandavardhana Hegde

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:generating clinically relevant, clinically relevant captions, transformer-based multimodal framework, multimodal framework, framework for generating

备注: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis

点击查看摘要

Abstract:We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

85. 【2510.24805】CT-Less Attenuation Correction Using Multiview Ensemble Conditional Diffusion Model on High-Resolution Uncorrected PET Images

链接https://arxiv.org/abs/2510.24805

作者:Alexandre St-Georges,Gabriel Richard,Maxime Toussaint,Christian Thibaudeau,Etienne Auger,Étienne Croteau,Stephen Cunnane,Roger Lecomte,Jean-Baptiste Michaud

类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:effective treatment tracking, accurate diagnostic results, positron emission tomography, accurate diagnostic, treatment tracking

备注: This is a preprint and not the final version of this paper

点击查看摘要

Abstract:Accurate quantification in positron emission tomography (PET) is essential for accurate diagnostic results and effective treatment tracking. A major issue encountered in PET imaging is attenuation. Attenuation refers to the diminution of photon detected as they traverse biological tissues before reaching detectors. When such corrections are absent or inadequate, this signal degradation can introduce inaccurate quantification, making it difficult to differentiate benign from malignant conditions, and can potentially lead to misdiagnosis. Typically, this correction is done with co-computed Computed Tomography (CT) imaging to obtain structural data for calculating photon attenuation across the body. However, this methodology subjects patients to extra ionizing radiation exposure, suffers from potential spatial misregistration between PET/CT imaging sequences, and demands costly equipment infrastructure. Emerging advances in neural network architectures present an alternative approach via synthetic CT image synthesis. Our investigation reveals that Conditional Denoising Diffusion Probabilistic Models (DDPMs) can generate high quality CT images from non attenuation corrected PET images in order to correct attenuation. By utilizing all three orthogonal views from non-attenuation-corrected PET images, the DDPM approach combined with ensemble voting generates higher quality pseudo-CT images with reduced artifacts and improved slice-to-slice consistency. Results from a study of 159 head scans acquired with the Siemens Biograph Vision PET/CT scanner demonstrate both qualitative and quantitative improvements in pseudo-CT generation. The method achieved a mean absolute error of 32 $\pm$ 10.4 HU on the CT images and an average error of (1.48 $\pm$ 0.68)\% across all regions of interest when comparing PET images reconstructed using the attenuation map of the generated pseudo-CT versus the true CT.

86. 【2510.24776】CFL-SparseMed: Communication-Efficient Federated Learning for Medical Imaging with Top-k Sparse Updates

链接https://arxiv.org/abs/2510.24776

作者:Gousia Habib,Aniket Bhardwaj,Ritvik Sharma,Shoeib Amin Banday,Ishfaq Ahmad Malik

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

关键词:face challenges due, Secure and reliable, effective patient treatment, centralized models face, models face challenges

备注

点击查看摘要

Abstract:Secure and reliable medical image classification is crucial for effective patient treatment, but centralized models face challenges due to data and privacy concerns. Federated Learning (FL) enables privacy-preserving collaborations but struggles with heterogeneous, non-IID data and high communication costs, especially in large networks. We propose \textbf{CFL-SparseMed}, an FL approach that uses Top-k Sparsification to reduce communication overhead by transmitting only the top k gradients. This unified solution effectively addresses data heterogeneity while maintaining model accuracy. It enhances FL efficiency, preserves privacy, and improves diagnostic accuracy and patient care in non-IID medical imaging settings. The reproducibility source code is available on \href{this https URL}{Github}.

87. 【2510.24770】DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI

链接https://arxiv.org/abs/2510.24770

作者:Bocheng Guo,Jin Wang,Yijie Li,Junyi Wang,Mingyu Gao,Puming Feng,Yuqian Chen,Jarrett Rushmore,Nikos Makris,Yogesh Rathi,Lauren J O'Donnell,Fan Zhang

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Tractography fiber clustering, brains structural connectivity, fiber clustering, Tractography fiber, diffusion MRI

备注: 11 pages

点击查看摘要

Abstract:Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.