本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1095篇论文，其中：

自然语言处理159篇
信息检索22篇
计算机视觉211篇

自然语言处理

1. 【2510.17800】Glyph: Scaling Context Windows via Visual-Text Compression

作者：Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large language models, Large language, increasingly rely, multi-step reasoning, Large

备注：

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at this https URL.

2. 【2510.17797】Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

链接：https://arxiv.org/abs/2510.17797

作者：Akshara Prabhakar,Roshan Ram,Zixiang Chen,Silvio Savarese,Frank Wang,Caiming Xiong,Huan Wang,Weiran Yao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：information grows exponentially, face increasing pressure, transform unstructured data, enterprises face increasing, Master Planning Agent

备注： Technical report; 13 pages plus references and appendices

点击查看摘要

Abstract:As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at this https URL and Dataset at this https URL

Comments:
Technical report; 13 pages plus references and appendices

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2510.17797 [cs.CL]

(or
arXiv:2510.17797v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.17797

Focus to learn more

              arXiv-issued DOI via DataCite</p>

3. 【2510.17795】Executable Knowledge Graphs for Replicating AI Research

链接：https://arxiv.org/abs/2510.17795

作者：Yujie Luo,Zhuoyun Yu,Xuehai Wang,Yuqi Zhu,Ningyu Zhang,Lanning Wei,Lun Du,Da Zheng,Huajun Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)

关键词：large language model, Replicating AI research, language model, crucial yet challenging, challenging task

备注： Work in progress

点击查看摘要

Abstract:Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at this https URL.

4. 【2510.17793】Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

链接：https://arxiv.org/abs/2510.17793

作者：Austin Xu,Xuan-Phi Nguyen,Yilun Zhou,Chien-Sheng Wu,Caiming Xiong,Shafiq Joty

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Finetuning specialized generative, popular paradigm, paradigm to meet, meet the increasing, increasing demand

备注： 29 pages, 9 tables, 6 figures

点击查看摘要

Abstract:Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

5. 【2510.17790】UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

链接：https://arxiv.org/abs/2510.17790

作者：Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：require accurate visual, accurate visual grounding, lengthy execution chains, Multimodal agents, programmatic tool calls

备注：

点击查看摘要

Abstract:Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

6. 【2510.17776】Mapping Post-Training Forgetting in Language Models at Scale

链接：https://arxiv.org/abs/2510.17776

作者：Jackson Harmon,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：remains poorly understood, largest capability gains, knowledge remains poorly, Scaled post-training, backward transfer

备注： 43 pages,15 figures

点击查看摘要

Abstract:Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not "average out" by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1-0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0-1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction-tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.

7. 【2510.17764】Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

链接：https://arxiv.org/abs/2510.17764

作者：Xiao Ye,Jacob Dineen,Zhaonan Li,Zhikun Xu,Weiyu Chen,Shijie Lu,Yuxi Huang,Ming Shen,Phu Tran,Ji-Eun Irene Yum,Muhammad Ali Khan,Muhammad Umar Afzal,Irbaz Bin Riaz,Ben Zhou

类目：Computation and Language (cs.CL)

关键词：Medical Large language, Large language models, Medical Large, language models achieve, models achieve strong

备注：

点击查看摘要

Abstract:Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.

8. 【2510.17759】VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

链接：https://arxiv.org/abs/2510.17759

作者：Qilin Liao,Anamika Lochab,Ruqi Zhang

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：extend large language, large language models, extend large, visual reasoning, large language

备注： 18 pages, 7 Figures,

点击查看摘要

Abstract:Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

9. 【2510.17733】rain for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

链接：https://arxiv.org/abs/2510.17733

作者：Tong Chen,Akari Asai,Luke Zettlemoyer,Hannaneh Hajishirzi,Faeze Brahman

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：incorrect information unsupported, Language models, information unsupported, Language, training data

备注：

点击查看摘要

Abstract:Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

10. 【2510.17725】AcademicEval: Live Long-Context LLM Benchmark

链接：https://arxiv.org/abs/2510.17725

作者：Haozhen Zhang,Tao Feng,Pengrui Han,Jiaxuan You

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, Language Models, recently achieved remarkable, achieved remarkable performance

备注： Accepted by TMLR. Code is available at [this https URL](https://github.com/ulab-uiuc/AcademicEval)

点击查看摘要

Abstract:Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at this https URL

11. 【2510.17720】PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

链接：https://arxiv.org/abs/2510.17720

作者：Nanda Kumar Rengarajan,Jun Yan,Chun Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Named Entity Recognition, requires substantial annotated, substantial annotated data, Named Entity, Entity Recognition

备注：

点击查看摘要

Abstract:Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.

12. 【2510.17715】QueST: Incentivizing LLMs to Generate Difficult Problems

链接：https://arxiv.org/abs/2510.17715

作者：Hanxu Hu,Xingxing Zhang,Jannis Vamvas,Rico Sennrich,Furu Wei

类目：Computation and Language (cs.CL)

关键词：solving competition-level coding, solving competition-level, problems, coding, Large Language Models

备注： 20 pages, 7 figures

点击查看摘要

Abstract:Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.

13. 【2510.17705】Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models

链接：https://arxiv.org/abs/2510.17705

作者：Dayan Pan,Zhaoyang Fu,Jingyuan Wang,Xiao Han,Yue Zhu,Xiangyu Zhao

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, possess remarkable generalization, remarkable generalization capabilities

备注： Accepted by CIKM' 25

点击查看摘要

Abstract:Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. The implemented code and data are available to ease reproducibility at this https URL.

14. 【2510.17698】owards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues

链接：https://arxiv.org/abs/2510.17698

作者：Liqun He,Manolis Mavrikis,Mutlu Cukurova

类目：Computation and Language (cs.CL)

关键词：existing evaluation methods, AIED Doctoral Consortium, large language models, Doctoral Consortium paper, primarily focus

备注：

点击查看摘要

Abstract:Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies.

15. 【2510.17671】LILO: Bayesian Optimization with Interactive Natural Language Feedback

链接：https://arxiv.org/abs/2510.17671

作者：Katarzyna Kobalczyk,Zhiyuan Jerry Lin,Benjamin Letham,Zhuokai Zhao,Maximilian Balandat,Eytan Bakshy

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：quantifiable optimization objectives, real-world applications, translating complex, optimization objectives, essential in translating

备注：

点击查看摘要

Abstract:For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.

16. 【2510.17662】DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

链接：https://arxiv.org/abs/2510.17662

作者：Massa Baali,Rita Singh,Bhiksha Raj

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：achieved remarkable success, capturing speaker-discriminative features, speaker-discriminative features critical, achieved remarkable, remarkable success

备注：

点击查看摘要

Abstract:Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.

17. 【2510.17652】Qomhra: A Bilingual Irish-English Large Language Model

链接：https://arxiv.org/abs/2510.17652

作者：Joseph McInerney

类目：Computation and Language (cs.CL)

关键词：bilingual continued pre-training, pipeline spanning bilingual, spanning bilingual continued, large language model, low-resource constraints presenting

备注：

点击查看摘要

Abstract:This paper introduces Qomhrá, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhrá is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhrá also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.

18. 【2510.17638】LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

链接：https://arxiv.org/abs/2510.17638

作者：Qingchuan Yang,Simon Mahns,Sida Li,Anri Gu,Jibang Wu,Haifeng Xu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：fundamental intellectual pursuit, finance and economics, fundamental intellectual, intellectual pursuit, significant importance

备注： [this https URL](https://www.prophetarena.co/)

点击查看摘要

Abstract:Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.

19. 【2510.17620】Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models

链接：https://arxiv.org/abs/2510.17620

作者：Yuefeng Peng,Parnian Afshar,Megan Ganji,Thomas Butler,Amir Houmansadr,Mingxian Wang,Dezhi Hong

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, compliant model responses, encode sensitive information, encode sensitive

备注：

点击查看摘要

Abstract:Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.

20. 【2510.17602】LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis

链接：https://arxiv.org/abs/2510.17602

作者：Huiyuan Xie,Chenyang Li,Huining Zhu,Chubin Zhang,Yuxiao Ye,Zhenghao Liu,Zhiyuan Liu

类目：Computation and Language (cs.CL)

关键词：Legal reasoning, Legal, reasoning, modeling legal reasoning, legal analysis

备注：

点击查看摘要

Abstract:Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.

21. 【2510.17598】Reasoning Distillation and Structural Alignment for Improved Code Generation

链接：https://arxiv.org/abs/2510.17598

作者：Amir Jalilifard,Anderson de Rezende Rocha,Marcos Medeiros Raimundo

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Effective code generation, passing diverse test, diverse test cases, target programming language, Effective code

备注：

点击查看摘要

Abstract:Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.

22. 【2510.17591】HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection

链接：https://arxiv.org/abs/2510.17591

作者：Guang Yang,Yujie Zhu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：Pre-trained language models, high-order data correlations, Pre-trained language, high-order data, data correlations

备注： Accepted by the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) as a findings long paper

点击查看摘要

Abstract:Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.

23. 【2510.17590】MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

链接：https://arxiv.org/abs/2510.17590

作者：Mir Nafis Sharear Shopnil,Sharad Duwal,Abhishek Tyagi,Adiba Mahbub Proma

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：overwhelming manual fact-checking, manual fact-checking capacity, daily multimodal posts, overwhelming manual, fact-checking capacity

备注： 16 pages, 3 tables, 1 figure

点击查看摘要

Abstract:Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

24. 【2510.17555】Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

链接：https://arxiv.org/abs/2510.17555

作者：Collin Zhang,Fei Huang,Chenhan Yuan,Junyang Lin

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, text generation, experience language confusion, language confusion

备注：

点击查看摘要

Abstract:Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at this https URL.

25. 【2510.17548】When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

链接：https://arxiv.org/abs/2510.17548

作者：Nisrine Rair,Alban Goupil,Valeriu Vrabie,Emmanuel Chochoy

类目：Computation and Language (cs.CL)

关键词：human annotators disagree, internally represent ambiguity, models internally represent, annotators disagree, evaluated with scalar

备注： Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference)

点击查看摘要

Abstract:Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.

Comments:
Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference)

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2510.17548 [cs.CL]

(or
arXiv:2510.17548v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.17548

Focus to learn more

              arXiv-issued DOI via DataCite</p>

26. 【2510.17532】OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

链接：https://arxiv.org/abs/2510.17532

作者：Raghu Vamshi Hemadri,Geetha Krishna Guruju,Kristi Topollai,Anna Ewa Choromanska

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Predicting cancer treatment, Predicting cancer, cancer treatment outcomes, treatment outcomes requires, heterogeneous clinical data

备注：

点击查看摘要

Abstract:Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

27. 【2510.17516】SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

链接：https://arxiv.org/abs/2510.17516

作者：Tiancheng Hu,Joachim Baumann,Lorenzo Lupo,Dirk Hovy,Nigel Collier,Paul Röttger

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：real human behaviors, reflect real human, faithfully reflect real, human behavior, real human

备注： Project Website: [this http URL](http://simbench.tiancheng.hu/) Data: [this https URL](https://huggingface.co/datasets/pitehu/SimBench)

点击查看摘要

Abstract:Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

28. 【2510.17509】Annotation-Efficient Universal Honesty Alignment

链接：https://arxiv.org/abs/2510.17509

作者：Shiyu Ni,Keping Bi,Jiafeng Guo,Minghao Tang,Jingtong Wu,Zengxin Han,Xueqi Cheng

类目：Computation and Language (cs.CL)

关键词：large language models, express calibrated confidence-is, calibrated confidence-is essential, Honesty alignment-the ability, language models

备注：

点击查看摘要

Abstract:Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

29. 【2510.17504】Lingua Custodi's participation at the WMT 2025 Terminology shared task

链接：https://arxiv.org/abs/2510.17504

作者：Jingshu Liu,Raheel Qader,Gaëtan Caillaut,Mariam Nakhlé

类目：Computation and Language (cs.CL)

关键词：learning BERT based, BERT based cross-lingual, transfer learning BERT, BERT based, learning BERT

备注：

点击查看摘要

Abstract:While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at this https URL.

30. 【2510.17498】Deep Self-Evolving Reasoning

链接：https://arxiv.org/abs/2510.17498

作者：Zihan Liu,Shun Zheng,Xumeng Wen,Yang Wang,Jiang Bian,Mao Yang

类目：Computation and Language (cs.CL)

关键词：large language models, cornerstone of advanced, large language, Long-form, DSER

备注：

点击查看摘要

Abstract:Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.

31. 【2510.17491】Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

链接：https://arxiv.org/abs/2510.17491

作者：Yihong Tang,Kehai Chen,Liang Yue,Jinxin Fan,Caishen Zhou,Xiaoguang Li,Yuyang Zhang,Mingming Zhao,Shixiong Kai,Kaiyang Guo,Xingshan Zeng,Wenjing Cun,Lifeng Shang,Min Zhang

类目：Computation and Language (cs.CL)

关键词：large language models, LLM agents capable, language models, rise of large, large language

备注：

点击查看摘要

Abstract:With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.

32. 【2510.17489】DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

链接：https://arxiv.org/abs/2510.17489

作者：Yongxin He,Shan Zhang,Yixuan Cao,Lei Ma,Ping Luo

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Detecting AI-involved text, Detecting AI-involved, combating misinformation, academic misconduct, essential for combating

备注： To appear in NeurIPS 2025

点击查看摘要

Abstract:Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at this https URL.

33. 【2510.17483】ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts

链接：https://arxiv.org/abs/2510.17483

作者：Zheyue Tan,Zhiyuan Li,Tao Yuan,Dong Zhou,Weilin Liu,Yueqing Zhuang,Yadong Li,Guowei Niu,Cheng Qin,Zhuyu Yao,Congyi Liu,Haiyang Xu,Boxun Li,Guohao Dai,Bo Zhao,Yu Wang

类目：Computation and Language (cs.CL)

关键词：scale Large Language, Large Language Models, scale Large, Large Language, promising approach

备注：

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.

34. 【2510.17476】Disparities in Multilingual LLM-Based Healthcare QA

链接：https://arxiv.org/abs/2510.17476

作者：Ipek Baris Schlicht,Burcu Sayin,Zhixue Zhao,Frederik M. Labonté,Cesare Barbera,Marco Viviani,Paolo Rosso,Lucie Flek

类目：Computation and Language (cs.CL)

关键词：Large Language Models, reliable health information, multilingual Large Language, Wiki Health Care, access to reliable

备注： Under review

点击查看摘要

Abstract:Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare QA across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.

35. 【2510.17460】Evaluating Large Language Models on Urdu Idiom Translation

链接：https://arxiv.org/abs/2510.17460

作者：Muhammad Farmal Khan,Mousumi Akter

类目：Computation and Language (cs.CL)

关键词：limited prior attention, received limited prior, low resource languages, Large Language Models, Native Urdu

备注：

点击查看摘要

Abstract:Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.

36. 【2510.17437】Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings

链接：https://arxiv.org/abs/2510.17437

作者：Manuela Daniela Danu,George Marica,Constantin Suciu,Lucian Mihai Itu,Oladimeji Farri

类目：Computation and Language (cs.CL)

关键词：including patient diagnosis, treatment effects assessment, electronic health record, rapidly increasing volume, unlock biomedical knowledge

备注： 11 pages, 5 figures, 1 table, published in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024)

点击查看摘要

Abstract:The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.

37. 【2510.17431】Agentic Reinforcement Learning for Search is Unsafe

链接：https://arxiv.org/abs/2510.17431

作者：Yushi Yang,Shreyansh Padarha,Andrew Lee,Adam Mahdi

类目：Computation and Language (cs.CL)

关键词：trains large language, autonomously call tools, Agentic reinforcement learning, large language models, reinforcement learning

备注：

点击查看摘要

Abstract:Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.

38. 【2510.17426】Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging

链接：https://arxiv.org/abs/2510.17426

作者：Tiancheng Hu,Benjamin Minixhofer,Nigel Collier

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：post-training is typically, typically framed, drop in task, alignment tax, task accuracy

备注：

点击查看摘要

Abstract:The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.

39. 【2510.17415】BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

链接：https://arxiv.org/abs/2510.17415

作者：Jiacheng Xie,Yang Yu,Yibo Chen,Hanyao Zhang,Lening Zhao,Jiaxuan He,Lei Jiang,Xiaoting Tang,Guanghui An,Dong Xu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM); Software Engineering (cs.SE)

关键词：Traditional Chinese Medicine, Chinese Medicine, Traditional Chinese, plays a role, global healthcare

备注：

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.

40. 【2510.17405】AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages

链接：https://arxiv.org/abs/2510.17405

作者：Mardiyyah Oduwole,Prince Mireku,Fatimo Adebanjo,Oluwatosin Olajide,Mahi Aminu Aliyu,Jekaterina Novikova

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：hindering the democratization, research has overwhelmingly, overwhelmingly focused, focused on high-resource, democratization of advancements

备注：

点击查看摘要

Abstract:Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.

41. 【2510.17402】Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

链接：https://arxiv.org/abs/2510.17402

作者：Jiacheng Xie,Shuai Zeng,Yang Yu,Xiaoting Tang,Guanghui An,Dong Xu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Traditional Chinese Medicine, structurally unique knowledge, challenges conventional applications, Chinese Medicine, large language models

备注：

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.

42. 【2510.17389】EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

链接：https://arxiv.org/abs/2510.17389

作者：Numaan Naeem,Abdellah El Mekki,Muhammad Abdul-Mageed

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：explaining complex concepts, Large language models, Large language, answering questions, explaining complex

备注： 28 pages, 2 figures, 14 tables, 50 listings, EMNLP 2025 Main

点击查看摘要

Abstract:Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at this https URL.

43. 【2510.17388】he Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

链接：https://arxiv.org/abs/2510.17388

作者：Henry Lim,Kwan Hui Lim

类目：Computation and Language (cs.CL)

关键词：Instruction-tuned large language, exhibit strong zero-shot, strong zero-shot reasoning, Instruction-tuned large, exhibit strong

备注： 11 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

44. 【2510.17354】owards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2510.17354

作者：Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：enhancing large language, large language models, external corpus, powerful paradigm, paradigm for enhancing

备注： This work is in progress

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

45. 【2510.17289】Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning

链接：https://arxiv.org/abs/2510.17289

作者：Hajar Bakarou,Mohamed Sinane El Messoussi,Anaïs Ollagnier

类目：Computation and Language (cs.CL)

关键词：including hate speech, poses growing risks, textit, Antisocial behavior, social media

备注：

点击查看摘要

Abstract:Antisocial behavior (ASB) on social media -- including hate speech, harassment, and cyberbullying -- poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textit{multi-party conversational settings} remain underexplored due to limited data. To address this gap, we use \textit{CyberAgressionAdo-Large}, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textit{abuse detection}, \textit{bullying behavior analysis}, and \textit{bullying peer-group identification}. We benchmark six text-based and eight graph-based \textit{representation-learning methods}, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \texttt{mBERT + WD-SGCN} achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.

46. 【2510.17263】axoAlign: Scholarly Taxonomy Generation Using Language Models

链接：https://arxiv.org/abs/2510.17263

作者：Avishek Lahiri,Yufang Hou,Debarshi Kumar Sanyal

类目：Computation and Language (cs.CL)

关键词：helping researchers structure, hierarchical manner, play a crucial, crucial role, role in helping

备注： This paper has been accepted at the EMNLP 2025 Main Conference

点击查看摘要

Abstract:Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at this https URL.

47. 【2510.17256】Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

链接：https://arxiv.org/abs/2510.17256

作者：Shahin Atakishiyev,Housam K.B. Babiker,Jiayi Dai,Nawshad Farruque,Teruaki Hayashi,Nafisa Sadaf Hriti,Md Abed Rahman,Iain Smith,Mi-Young Kim,Osmar R. Zaïane,Randy Goebel

类目：Computation and Language (cs.CL)

关键词：exhibited impressive performance, natural language processing, Large language models, language models, Large language

备注：

点击查看摘要

Abstract:Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

48. 【2510.17252】How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design

链接：https://arxiv.org/abs/2510.17252

作者：Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Ayesha Siddiqua,Jungpil Shin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：media often shape, shape the public, public mood, Abstract, reflecting subtle emotional

备注： 15 pages, 7 figures, 4 tables. Submitted to the International Conference on Data and Applied Analytics (IDAA 2025)

点击查看摘要

Abstract:News media often shape the public mood not only by what they report but by how they frame it. The same event can appear calm in one outlet and alarming in another, reflecting subtle emotional bias in reporting. Negative or emotionally charged headlines tend to attract more attention and spread faster, which in turn encourages outlets to frame stories in ways that provoke stronger reactions. This research explores that tendency through large-scale emotion analysis of Bengali news. Using zero-shot inference with Gemma-3 4B, we analyzed 300000 Bengali news headlines and their content to identify the dominant emotion and overall tone of each. The findings reveal a clear dominance of negative emotions, particularly anger, fear, and disappointment, and significant variation in how similar stories are emotionally portrayed across outlets. Based on these insights, we propose design ideas for a human-centered news aggregator that visualizes emotional cues and helps readers recognize hidden affective framing in daily news.

49. 【2510.17247】From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

链接：https://arxiv.org/abs/2510.17247

作者：Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, significantly enhanced, Recent, video generation, reward models trained

备注：

点击查看摘要

Abstract:Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

50. 【2510.17238】StreamingThinker: Large Language Models Can Think While Reading

链接：https://arxiv.org/abs/2510.17238

作者：Junlong Tong,Yingqi Fan,Anhao Zhao,Yunpu Ma,Xiaoyu Shen

类目：Computation and Language (cs.CL)

关键词：Large language models, demonstrated remarkable capabilities, Large language, chain of thought, reasoning

备注：

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{this https URL}{this repository.}

51. 【2510.17210】Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

链接：https://arxiv.org/abs/2510.17210

作者：Chenchen Tan,Youyang Qu,Xinghao Li,Hui Zhang,Shujie Cui,Cunjian Chen,Longxiang Gao

类目：Computation and Language (cs.CL)

关键词：AI-assisted decision-making boost, large language models, increase in computing, computing power, necessity of AI-assisted

备注： 22 pages, 10 figures

点击查看摘要

Abstract:The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.

52. 【2510.17206】Soft-Masked Diffusion Language Models

链接：https://arxiv.org/abs/2510.17206

作者：Michael Hersche,Samuel Moor-Smith,Thomas Hofmann,Abbas Rahimi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：traditional autoregressive approaches, demonstrated strong potential, offering various advantages, autoregressive approaches, demonstrated strong

备注：

点击查看摘要

Abstract:Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.

53. 【2510.17205】$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

链接：https://arxiv.org/abs/2510.17205

作者：Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Large Language Models, Large Language, achieved strong performance, significant computational overhead

备注： EMNLP 2025 Main

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: this https URL.

54. 【2510.17196】Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

链接：https://arxiv.org/abs/2510.17196

作者：Jiaqi Leng,Xiang Hu,Junxiong Wang,Jianguo Li,Wei Wu,Yucheng Lu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Effectively processing long, processing long contexts, long contexts, Bypassing Residual Path, full context due

备注： Preprint. Work in progress

点击查看摘要

Abstract:Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

55. 【2510.17173】Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

链接：https://arxiv.org/abs/2510.17173

作者：Melik Ozolcer,Sang Won Bae

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：tool-augmented LLM health, LLM health coach, tool-augmented LLM, LLM health, study a web-deployed

备注： Accepted to the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models

点击查看摘要

Abstract:We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

56. 【2510.17168】When AI companions become witty: Can human brain recognize AI-generated irony?

链接：https://arxiv.org/abs/2510.17168

作者：Xiaohui Rao,Hanlin Wu,Zhenguang G. Cai

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, mere computational output, Language Models, question emerges

备注：

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current LLMs' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.

57. 【2510.17139】Rethinking On-policy Optimization for Query Augmentation

链接：https://arxiv.org/abs/2510.17139

作者：Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Recent advances, large language models, advances in large, large language, surge of interest

备注：

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.

58. 【2510.17132】Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction

链接：https://arxiv.org/abs/2510.17132

作者：Ioannis Tsaknakis,Bingqing Song,Shuyu Gan,Dongyeop Kang,Alfredo Garcia,Gaowen Liu,Charles Fleming,Mingyi Hong

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, producing broadly relevant, broadly relevant text, Large Language, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2510.17132 [cs.LG]

(or
arXiv:2510.17132v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.17132

Focus to learn more

              arXiv-issued DOI via DataCite</p>

59. 【2510.17115】DVAGen: Dynamic Vocabulary Augmented Generation

链接：https://arxiv.org/abs/2510.17115

作者：Wei Du,Nuowei Liu,Jie Wang,Jiahao Kuang,Tao Ji,Xiaoling Wang,Yuanbin Wu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：diverse token combinations, handling diverse token, fixed vocabulary struggle, Language models trained, limiting their flexibility

备注：

点击查看摘要

Abstract:Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

60. 【2510.17109】Verification-Aware Planning for Multi-Agent Systems

链接：https://arxiv.org/abs/2510.17109

作者：Tianyang Xu,Dan Zhang,Kushan Mitra,Estevam Hruschka

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：multiple specialized agents, Large language model, tackle complex tasks, Large language, specialized agents

备注： Submission for ARR Oct

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.

61. 【2510.17062】Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

链接：https://arxiv.org/abs/2510.17062

作者：Guoqing Luo,Iffat Maab,Lili Mou,Junichi Yamagishi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：reasoning-based large language, structured thinking process, large language models, language models excel, reasoning-based large

备注：

点击查看摘要

Abstract:While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

62. 【2510.17028】Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

链接：https://arxiv.org/abs/2510.17028

作者：Kyle Cox,Jiawei Xu,Yikun Han,Rong Xu,Tianhao Li,Chi-Yang Hsu,Tianlong Chen,Walter Gerych,Ying Ding

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：interesting behavior, behavior in large, prompt, uncertainty, prompt sensitivity

备注：

点击查看摘要

Abstract:An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

63. 【2510.17021】Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

链接：https://arxiv.org/abs/2510.17021

作者：Bingqi Shang,Yiwei Chen,Yihua Zhang,Bingquan Shen,Sijia Liu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large language model, Large language, removing undesired data, general utility, critical mechanism

备注：

点击查看摘要

Abstract:Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at this https URL.

64. 【2510.17018】Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification

链接：https://arxiv.org/abs/2510.17018

作者：Noor Islam S. Mohammad

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：incur high computational, high computational costs, classical ensembles lack, ensembles lack semantic, comment detection remains

备注：

点击查看摘要

Abstract:Toxic comment detection remains a challenging task, where transformer-based models (e.g., BERT) incur high computational costs and degrade on minority toxicity classes, while classical ensembles lack semantic adaptability. We propose xLSTM, a parameter-efficient and theoretically grounded framework that unifies cosine-similarity gating, adaptive feature prioritization, and principled class rebalancing. A learnable reference vector {v} in {R}^d modulates contextual embeddings via cosine similarity, amplifying toxic cues and attenuating benign signals to yield stronger gradients under severe class imbalance. xLSTM integrates multi-source embeddings (GloVe, FastText, BERT CLS) through a projection layer, a character-level BiLSTM for morphological cues, embedding-space SMOTE for minority augmentation, and adaptive focal loss with dynamic class weighting. On the Jigsaw Toxic Comment benchmark, xLSTM attains 96.0% accuracy and 0.88 macro-F1, outperforming BERT by 33% on threat and 28% on identity_hate categories, with 15 times fewer parameters and 50ms inference latency. Cosine gating contributes a +4.8% F1 gain in ablations. The results establish a new efficiency adaptability frontier, demonstrating that lightweight, theoretically informed architectures can surpass large pretrained models on imbalanced, domain-specific NLP tasks.

65. 【2510.17017】SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

链接：https://arxiv.org/abs/2510.17017

作者：Qiusi Zhan,Angeline Budiman-Chan,Abdelrahman Zayed,Xingzhi Guo,Daniel Kang,Joo-Kyung Kim

类目：Computation and Language (cs.CL)

关键词：Large language model, answer open-domain questions, Large language, retrieve external information, agents iteratively generate

备注： Code: [this https URL](https://github.com/ZQS1943/SafeSearch)

点击查看摘要

Abstract:Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked "How can I track someone's location without their consent?", a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

66. 【2510.17013】DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

链接：https://arxiv.org/abs/2510.17013

作者：Lanni Bu,Lauren Levin,Amir Zeldes

类目：Computation and Language (cs.CL)

关键词：Recent LLM benchmarks, Recent LLM, responses often tar, natural language understanding, LLM benchmark target

备注：

点击查看摘要

Abstract:Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.

67. 【2510.17006】Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

链接：https://arxiv.org/abs/2510.17006

作者：Masahiro Kaneko,Zeerak Talat,Timothy Baldwin

类目：Computation and Language (cs.CL)

关键词：large language models, Iterative jailbreak methods, induce harmful outputs, highly effective attack, effective attack strategy

备注：

点击查看摘要

Abstract:Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

68. 【2510.17001】Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

链接：https://arxiv.org/abs/2510.17001

作者：Yuval Reif,Guy Kaplan,Roy Schwartz

类目：Computation and Language (cs.CL)

关键词：Large language models, walk, Large language, shown to encode, linear directions

备注：

点击查看摘要

Abstract:Large language models (LLMs) were shown to encode word form variations, such as "walk"-"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.

69. 【2510.17000】Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

链接：https://arxiv.org/abs/2510.17000

作者：Masahiro Kaneko,Timothy Baldwin

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：large language models, language models, model reply, reply is observed, malicious users

备注： NeurIPS 2025 (spotlight)

点击查看摘要

Abstract:Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.

70. 【2510.16987】Back to Bytes: Revisiting Tokenization Through UTF-8

链接：https://arxiv.org/abs/2510.16987

作者：Amit Moryossef,Clara Meister,Pavel Stepachev,Desmond Elliott

类目：Computation and Language (cs.CL)

关键词：minimalist byte-level tokenizer, tokenizer that maps, minimalist byte-level, byte-level tokenizer, maps text

备注：

点击查看摘要

Abstract:We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, "thinking" spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.

71. 【2510.16985】Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection

链接：https://arxiv.org/abs/2510.16985

作者：Akif Islam,Mohd Ruhul Ameen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：disproportionately affecting women, social media platforms, Bengali social media, disproportionately affecting, women and adolescents

备注： Accepted to IEEE COMPAS 2025. 6 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.

72. 【2510.16968】Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

链接：https://arxiv.org/abs/2510.16968

作者：Pingzhi Li,Morris Yu-Chao Huang,Zhen Tan,Qingquan Song,Jie Peng,Kai Zou,Yu Cheng,Kaidi Xu,Tianlong Chen

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：poses intellectual property, intellectual property protection, LLM diversity risks, large language models, Knowledge Distillation

备注： Code is at [this https URL](https://github.com/unites-lab/shadow-moe)

点击查看摘要

Abstract:Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate 94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.

73. 【2510.16952】Real-Time World Crafting: Generating Structured Game Behaviors from Natural Language with Large Language Models

链接：https://arxiv.org/abs/2510.16952

作者：Austin Drake,Hang Dong

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词：safely integrating Large, integrating Large Language, Large Language Models, interactive game engines, integrating Large

备注： 16 pages, 11 figures (including appendix). To be presented at the 5th Wordplay @ EMNLP workshop (2025)

点击查看摘要

Abstract:We present a novel architecture for safely integrating Large Language Models (LLMs) into interactive game engines, allowing players to "program" new behaviors using natural language. Our framework mitigates risks by using an LLM to translate commands into a constrained Domain-Specific Language (DSL), which configures a custom Entity-Component-System (ECS) at runtime. We evaluated this system in a 2D spell-crafting game prototype by experimentally assessing models from the Gemini, GPT, and Claude families with various prompting strategies. A validated LLM judge qualitatively rated the outputs, showing that while larger models better captured creative intent, the optimal prompting strategy is task-dependent: Chain-of-Thought improved creative alignment, while few-shot examples were necessary to generate more complex DSL scripts. This work offers a validated LLM-ECS pattern for emergent gameplay and a quantitative performance comparison for developers.

74. 【2510.16943】Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

链接：https://arxiv.org/abs/2510.16943

作者：Dania Refai,Moataz Ahmed

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, convert natural language, natural language descriptions, Large language, natural language

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective. Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.

75. 【2510.16932】Prompt-MII: Meta-Learning Instruction Induction for LLMs

链接：https://arxiv.org/abs/2510.16932

作者：Emily Xiao,Yixiao Zeng,Ada Chen,Chin-Jou Li,Amanda Bertsch,Graham Neubig

类目：Computation and Language (cs.CL)

关键词：context length grows, adapt large language, incurs high inference, high inference costs, large language models

备注：

点击查看摘要

Abstract:A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.

76. 【2510.16928】ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

链接：https://arxiv.org/abs/2510.16928

作者：Emily Chang,Niyati Bafna

类目：Computation and Language (cs.CL)

关键词：restricted to high, largely restricted, large language models, higher-order tasks, Existing

备注：

点击查看摘要

Abstract:Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

77. 【2510.16926】Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

链接：https://arxiv.org/abs/2510.16926

作者：Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注： 23 pages,19 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

78. 【2510.16924】Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

链接：https://arxiv.org/abs/2510.16924

作者：Zhihui Yang,Yupei Wang,Kaijie Mo,Zhe Zhao,Renfen Hu

类目：Computation and Language (cs.CL)

关键词：multimodal language models, embodied knowledge understanding, embodied knowledge, embodied knowledge compared, significant progress

备注： Accepted to EMNLP 2025 (Findings). This version corrects a redundant sentence in the Results section that appeared in the camera-ready version

点击查看摘要

Abstract:Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models' perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.

79. 【2510.16917】SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

链接：https://arxiv.org/abs/2510.16917

作者：Chih-Kai Yang,Yen-Ting Piao,Tzu-Wen Hsu,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：Large Audio-Language Models, full retraining, offers an efficient, prior work, work has concentrated

备注： Work in progress

点击查看摘要

Abstract:Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability. Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates. SAKE provides a principled framework to study how knowledge editing extends to the auditory modalities, opening new directions for maintaining and adapting LALMs in more diverse real-world scenarios.

80. 【2510.16907】VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

链接：https://arxiv.org/abs/2510.16907

作者：Kangrui Wang,Pingyue Zhang,Zihan Wang,Yaning Gao,Linjie Li,Qineng Wang,Hanyang Chen,Chi Wan,Yiping Lu,Zhengyuan Yang,Lijuan Wang,Ranjay Krishna,Jiajun Wu,Li Fei-Fei,Yejin Choi,Manling Li

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：complex visual observations, key challenge, shift from textual, Partially Observable Markov, Observable Markov Decision

备注： Accepted to NeurIPS 2025

点击查看摘要

Abstract:A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at this https URL.

81. 【2510.16893】Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

链接：https://arxiv.org/abs/2510.16893

作者：Bo-Han Feng,Chien-Feng Liu,Yu-Hsuan Li Liang,Chih-Kai Yang,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：Large audio-language models, extend text-based LLMs, Large audio-language, audio-language models, extend text-based

备注： Submitted to ICASSP 2026

点击查看摘要

Abstract:Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.

82. 【2510.16882】Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

链接：https://arxiv.org/abs/2510.16882

作者：Heming Zou,Yixiu Mao,Yun Qu,Qi Wang,Xiangyang Ji

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：adapt large language, large language models, Supervised fine-tuning, downstream tasks, commonly used technique

备注：

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbf{UDS (Utility-Diversity Sampling)}, a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at this https URL.

83. 【2510.16872】DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

链接：https://arxiv.org/abs/2510.16872

作者：Shaolei Zhang,Ju Fan,Meihao Fan,Guoliang Li,Xiaoyong Du

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)

关键词：Autonomous data science, powerful large language, Autonomous data, deep research reports, data science

备注： Code: [this https URL](https://github.com/ruc-datalab/DeepAnalyze) Model: [this https URL](https://huggingface.co/RUC-DataLab/DeepAnalyze-8B)

点击查看摘要

Abstract:Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.

84. 【2510.16851】Neuronal Group Communication for Efficient Neural representation

链接：https://arxiv.org/abs/2510.16851

作者：Zhengqi Pei,Qingming Huang,Shuhui Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

关键词：alongside daunting challenges, brought unprecedented performance, unprecedented performance alongside, performance alongside daunting, efficiency and interpretability

备注： 28 pages, 2 figures

点击查看摘要

Abstract:The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.

85. 【2510.16844】FinSight: Towards Real-World Financial Deep Research

链接：https://arxiv.org/abs/2510.16844

作者：Jiajie Jin,Yuyao Zhang,Yimeng Xu,Hongjin Qian,Yutao Zhu,Zhicheng Dou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词：intellectually demanding process, Generating professional financial, professional financial reports, fully automate, labor-intensive and intellectually

备注： Working in progress

点击查看摘要

Abstract:Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.

86. 【2510.16830】Verifiable Fine-Tuning for LLMs: Zero-Knowledge Training Proofs Bound to Data Provenance and Policy

链接：https://arxiv.org/abs/2510.16830

作者：Hasan Akgul,Daniel Borg,Arta Berisha,Amina Rahimova,Andrej Novak,Mila Petrov

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Large language models, current release practices, release practices provide, practices provide weak, provide weak assurances

备注： 20 pages, 10 figures

点击查看摘要

Abstract:Large language models are often adapted through parameter efficient fine tuning, but current release practices provide weak assurances about what data were used and how updates were computed. We present Verifiable Fine Tuning, a protocol and system that produces succinct zero knowledge proofs that a released model was obtained from a public initialization under a declared training program and an auditable dataset commitment. The approach combines five elements. First, commitments that bind data sources, preprocessing, licenses, and per epoch quota counters to a manifest. Second, a verifiable sampler that supports public replayable and private index hiding batch selection. Third, update circuits restricted to parameter efficient fine tuning that enforce AdamW style optimizer semantics and proof friendly approximations with explicit error budgets. Fourth, recursive aggregation that folds per step proofs into per epoch and end to end certificates with millisecond verification. Fifth, provenance binding and optional trusted execution property cards that attest code identity and constants. On English and bilingual instruction mixtures, the method maintains utility within tight budgets while achieving practical proof performance. Policy quotas are enforced with zero violations, and private sampling windows show no measurable index leakage. Federated experiments demonstrate that the system composes with probabilistic audits and bandwidth constraints. These results indicate that end to end verifiable fine tuning is feasible today for real parameter efficient pipelines, closing a critical trust gap for regulated and decentralized deployments.

87. 【2510.16829】Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation

链接：https://arxiv.org/abs/2510.16829

作者：Navreet Kaur,Hoda Ayad,Hayoung Jung,Shravika Mittal,Munmun De Choudhury,Tanushree Mitra

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词：Language model users, Language model, personal and social, Language, User-centric Question Simulation

备注：

点击查看摘要

Abstract:Language model users often embed personal and social context in their questions. The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response. However, most evaluations, while capturing the model's capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses. We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.

88. 【2510.16819】Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank

链接：https://arxiv.org/abs/2510.16819

作者：Shantanu Agarwal,Joel Barry,Steven Fincke,Scott Miller

类目：Computation and Language (cs.CL)

关键词：Authorship attribution, candidate authors, task of identifying, query document, predefined set

备注：

点击查看摘要

Abstract:Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.

89. 【2510.16815】Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities

链接：https://arxiv.org/abs/2510.16815

作者：Hans Hergen Lehmann,Jae Hee Lee,Steven Schockaert,Stefan Wermter

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, genuine knowledge versus, knowledge versus superficial, heuristics remains challenging

备注： 33 pages, 20 figures. Submitted ACL ARR 2025 October (under review)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?''), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model's own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7--8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.

90. 【2510.16809】When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

链接：https://arxiv.org/abs/2510.16809

作者：Amirkia Rafiei Oskooei,Kaan Baturalp Cosdan,Husamettin Isiktas,Mehmet S. Aktas

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词：Large Language Models, Large Language, Language Models, vast context windows, context windows offer

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.

91. 【2510.16797】MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

链接：https://arxiv.org/abs/2510.16797

作者：Vera Pavlova,Mohammed Makhlouf

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：In-domain Contrastive learning, sentence embedding models, introduce MOSAIC, incorporates joint domain-specific, In-domain Contrastive

备注：

点击查看摘要

Abstract:We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

92. 【2510.16783】LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

链接：https://arxiv.org/abs/2510.16783

作者：Sheikh Jubair,Arwa Omayrah,Amal Alshammari,Alhanoof Althnian,Abdulhamed Alothaimen,Norah A. Alzahrani,Shahad D. Alzaidi,Nora Al-Twairesh,Abdulmohsen Al-Thubaity

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, demonstrated sophisticated capabilities, Recent advancements, advancements in Large

备注： 1 figure, 15 tables, 10 main pages

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.

93. 【2510.16781】Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

链接：https://arxiv.org/abs/2510.16781

作者：Shihao Ji,Zihui Song

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Visual Language Models, large-scale Visual Language, Language Models, Visual Language, remarkable zero-shot reasoning

备注：

点击查看摘要

Abstract:The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

94. 【2510.16769】See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

链接：https://arxiv.org/abs/2510.16769

作者：Shuo Han,Yukun Cao,Zezhong Ding,Zengyi Gao,S Kevin Zhou,Xike Xie

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：lacking effective mechanisms, Vision-language models, facing scalability bottlenecks, input-token constraints, graph understanding

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

95. 【2510.16761】Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games

链接：https://arxiv.org/abs/2510.16761

作者：Yikai Zhang,Ye Rong,Siyu Yuan,Jiangjie Chen,Jian Xie,Yanghua Xiao

类目：Computation and Language (cs.CL)

关键词：Existing language agents, Existing language, dynamic adversarial games, encounter difficulties, due to poor

备注：

点击查看摘要

Abstract:Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.

96. 【2510.16756】End-to-end Listen, Look, Speak and Act

链接：https://arxiv.org/abs/2510.16756

作者：Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Lu Lu,Chao Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Audio and Speech Processing (eess.AS)

关键词：models simulating humans, fluidly adapt, Human interaction, speak while acting, inherently multimodal

备注： 22 pages, 8 figures

点击查看摘要

Abstract:Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

97. 【2510.16743】Zero-Shot Performance Prediction for Probabilistic Scaling Laws

链接：https://arxiv.org/abs/2510.16743

作者：Viktoria Schram,Markus Hiller,Daniel Beck,Trevor Cohn

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Natural Language Processing, Language Processing, Natural Language, specific performance objectives, enables informed decision-making

备注： Accepted to NeurIPS 2025

点击查看摘要

Abstract:The prediction of learning curves for Natural Language Processing (NLP) models enables informed decision-making to meet specific performance objectives, while reducing computational overhead and lowering the costs associated with dataset acquisition and curation. In this work, we formulate the prediction task as a multitask learning problem, where each task's data is modelled as being organized within a two-layer hierarchy. To model the shared information and dependencies across tasks and hierarchical levels, we employ latent variable multi-output Gaussian Processes, enabling to account for task correlations and supporting zero-shot prediction of learning curves (LCs). We demonstrate that this approach facilitates the development of probabilistic scaling laws at lower costs. Applying an active learning strategy, LCs can be queried to reduce predictive uncertainty and provide predictions close to ground truth scaling laws. We validate our framework on three small-scale NLP datasets with up to $30$ LCs. These are obtained from nanoGPT models, from bilingual translation using mBART and Transformer models, and from multilingual translation using M2M100 models of varying sizes.

98. 【2510.16727】Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

链接：https://arxiv.org/abs/2510.16727

作者：Sanskar Pandey,Ruhaan Chopra,Angkul Puniya,Sohom Pal

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, language models internalize, obsequious flattery, emerging from reward

备注：

点击查看摘要

Abstract:Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

99. 【2510.16724】A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

链接：https://arxiv.org/abs/2510.16724

作者：Minhua Lin,Zongyu Wu,Zhichao Xu,Hui Liu,Xianfeng Tang,Qi He,Charu Aggarwal,Hui Liu,Xiang Zhang,Suhang Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：open-ended natural language, large language models, transformed information access, natural language interaction, large language

备注： 38 pages, 4 figures, 7 tables

点击查看摘要

Abstract:The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at this https URL.

100. 【2510.16718】U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

链接：https://arxiv.org/abs/2510.16718

作者：Xusheng Yang,Long Zhou,Wenfu Wang,Kai Hu,Shulin Feng,Chenxing Li,Meng Yu,Dong Yu,Yuexian Zou

类目：ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：ltra low frame-rate, low frame-rate neural, extremely low frame-rate, low frame-rate, frame-rate neural speech

备注：

点击查看摘要

Abstract:We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.

101. 【2510.16713】so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs

链接：https://arxiv.org/abs/2510.16713

作者：Sriharsh Bhyravajjula,Melanie Walsh,Anna Preus,Maria Antoniak

类目：Computation and Language (cs.CL)

关键词：reflecting both adherence, critical component, adherence to standardized, Whitespace, standardized forms

备注：

点击查看摘要

Abstract:Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.

102. 【2510.16712】he Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

链接：https://arxiv.org/abs/2510.16712

作者：Shivam Ratnakar,Sanjay Raghavendra

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Integration of Large, Large Language, retrieval engines, undermines their reliability

备注：

点击查看摘要

Abstract:Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.

103. 【2510.16708】Natural Language Processing Applications in Cardiology: A Narrative Review

链接：https://arxiv.org/abs/2510.16708

作者：Kailai Yang,Yan Leng,Xin Zhang,Tianlin Zhang,Paul Thompson,Bernard Keavney,Maciej Tomaszewski,Sophia Ananiadou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：health and well-being, NLP, prevalent in modern, modern society, significant effect

备注：

点击查看摘要

Abstract:Cardiovascular disease has become increasingly prevalent in modern society and has a significant effect on global health and well-being. Heart-related conditions are intricate, multifaceted disorders, which may be influenced by a combination of genetic predispositions, lifestyle choices, and various socioeconomic and clinical factors. Information regarding these potentially complex interrelationships is dispersed among diverse types of textual data, which include patient narratives, medical records, and scientific literature, among others. Natural language processing (NLP) techniques have increasingly been adopted as a powerful means to analyse and make sense of this vast amount of unstructured data. This, in turn, can allow healthcare professionals to gain deeper insights into the cardiology field, which has the potential to revolutionize current approaches to the diagnosis, treatment, and prevention of cardiac problems. This review provides a detailed overview of NLP research in cardiology between 2014 and 2025. We queried six literature databases to find articles describing the application of NLP techniques in the context of a range of different cardiovascular diseases. Following a rigorous screening process, we identified a total of 265 relevant articles. We analysed each article from multiple dimensions, i.e., NLP paradigm types, cardiology-related task types, cardiovascular disease types, and data source types. Our analysis reveals considerable diversity within each of these dimensions, thus demonstrating the considerable breadth of NLP research within the field. We also perform a temporal analysis, which illustrates the evolution and changing trends in NLP methods employed over the last decade that we cover. To our knowledge, the review constitutes the most comprehensive overview of NLP research in cardiology to date.

104. 【2510.16686】Investigating the Impact of Rationales for LLMs on Natural Language Understanding

链接：https://arxiv.org/abs/2510.16686

作者：Wenhang Shi,Shuqing Bian,Yiren Chen,Xinyi Zhang,Zhe Zhao,Pengfei Hu,Wei Lu,Xiaoyong Du

类目：Computation and Language (cs.CL)

关键词：derive final answers, NLU, derive final, NLU tasks, tasks

备注：

点击查看摘要

Abstract:Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.

105. 【2510.16685】mporal Understanding under Deictic Frame of Reference

链接：https://arxiv.org/abs/2510.16685

作者：Damin Zhang,Julia Rayz

类目：Computation and Language (cs.CL)

关键词：spatial metaphors grounded, sensory-motor experience, conceptualized through spatial, spatial metaphors, metaphors grounded

备注： Under review

点击查看摘要

Abstract:Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience. For example, "summer is approaching" parallels "We are approaching the summer". In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer's moment of "now". While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited. In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of "now" dynamically shifts along a timeline. Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points. Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events. The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.

106. 【2510.16670】All You Need is One: Capsule Prompt Tuning with a Single Vector

链接：https://arxiv.org/abs/2510.16670

作者：Yiyang Liu,James C. Liang,Heng Fan,Wenhao Yang,Yiming Cui,Xiaotian Han,Lifu Huang,Dongfang Liu,Qifan Wang,Cheng Han

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：facilitate Large Language, facilitate Large, Large Language Model, Prompt-based learning, Large Language

备注： NeurIPS 2025

点击查看摘要

Abstract:Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03\% average accuracy on T5-Large), serving as an "attention anchor," while enjoying high parameter efficiency (e.g., 0.003\% of model parameters on Llama3.2-1B).

107. 【2510.16645】Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

链接：https://arxiv.org/abs/2510.16645

作者：Zhixuan He,Yue Feng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：Large Language Models, Large Language, demonstrate strong performance, Diverse Thinking Modes, lack interpretable reasoning

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.

108. 【2510.16635】Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis

链接：https://arxiv.org/abs/2510.16635

作者：Wonduk Seo,Juhyeon Lee,Junseo Koh,Hyunjin An,Jian Park,Seunghyun Lee,Haihua Chen,Yi Bu

类目：Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, performance of Large, effective alternative

备注： Preprint

点击查看摘要

Abstract:Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.

109. 【2510.16604】Fine-tuning of Large Language Models for Constituency Parsing Using a Sequence to Sequence Approach

链接：https://arxiv.org/abs/2510.16604

作者：Francisco Jose Cortes Delgado,Eduardo Martinez Gracia,Rafael Valencia Garcia

类目：Computation and Language (cs.CL)

关键词：natural language processing, Recent advances, syntactic analysis based, large neural models, machine learning

备注： 6 pages, 3 figures. Submitted to SEPLN 2023 Conference

点击查看摘要

Abstract:Recent advances in natural language processing with large neural models have opened new possibilities for syntactic analysis based on machine learning. This work explores a novel approach to phrase-structure analysis by fine-tuning large language models (LLMs) to translate an input sentence into its corresponding syntactic structure. The main objective is to extend the capabilities of MiSintaxis, a tool designed for teaching Spanish syntax. Several models from the Hugging Face repository were fine-tuned using training data generated from the AnCora-ES corpus, and their performance was evaluated using the F1 score. The results demonstrate high accuracy in phrase-structure analysis and highlight the potential of this methodology.

110. 【2510.16588】Copy-Augmented Representation for Structure Invariant Template-Free Retrosynthesis

链接：https://arxiv.org/abs/2510.16588

作者：Jiaxi Zhuang,Yu Zhang,Aimin Zhou,Ying Qian

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Retrosynthesis prediction, requiring the identification, produce a target, chemical synthesis, Retrosynthesis

备注：

点击查看摘要

Abstract:Retrosynthesis prediction is fundamental to drug discovery and chemical synthesis, requiring the identification of reactants that can produce a target molecule. Current template-free methods struggle to capture the structural invariance inherent in chemical reactions, where substantial molecular scaffolds remain unchanged, leading to unnecessarily large search spaces and reduced prediction accuracy. We introduce C-SMILES, a novel molecular representation that decomposes traditional SMILES into element-token pairs with five special tokens, effectively minimizing editing distance between reactants and products. Building upon this representation, we incorporate a copy-augmented mechanism that dynamically determines whether to generate new tokens or preserve unchanged molecular fragments from the product. Our approach integrates SMILES alignment guidance to enhance attention consistency with ground-truth atom mappings, enabling more chemically coherent predictions. Comprehensive evaluation on USPTO-50K and large-scale USPTO-FULL datasets demonstrates significant improvements: 67.2% top-1 accuracy on USPTO-50K and 50.8% on USPTO-FULL, with 99.9% validity in generated molecules. This work establishes a new paradigm for structure-aware molecular generation with direct applications in computational drug discovery.

111. 【2510.16573】AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu

链接：https://arxiv.org/abs/2510.16573

作者：Muhammad Ammar,Hadiya Murad Hadi,Usman Majeed Butt

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, resembles human writing, closely resembles human, Large Language, making them powerful

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.

112. 【2510.16567】Hallucination Benchmark for Speech Foundation Models

链接：https://arxiv.org/abs/2510.16567

作者：Alkis Koudounas,Moreno La Quatra,Manuel Giollo,Sabato Marco Siniscalchi,Elena Baralis

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：underlying acoustic input, coherent transcriptions produced, neural ASR models, systems refer, acoustic input

备注： Under Review

点击查看摘要

Abstract:Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.

113. 【2510.16565】Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models

链接：https://arxiv.org/abs/2510.16565

作者：Seungho Cho,Changgeon Ko,Eui Jun Hwang,Junmyeong Lee,Huije Lee,Jong C. Park

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, making accurate cultural, diverse cultural contexts, Large language, making accurate

备注： Accepted to CIKM 2025 Workshop on Human Centric AI

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace LLMs' internal cultural understanding mechanisms by measuring activation path overlaps when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low overlap and high variability, showing that linguistic similarity does not guarantee aligned internal representation.

114. 【2510.16549】ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation

链接：https://arxiv.org/abs/2510.16549

作者：Haoxuan Zhang,Ruochi Li,Sarthak Shrestha,Shree Harshini Mamidala,Revanth Putta,Arka Krishan Aggarwal,Ting Xiao,Junhua Ding,Haihua Chen

类目：Computation and Language (cs.CL)

关键词：large language models, Peer review serves, present unprecedented challenges, Peer review, gatekeeper of science

备注：

点击查看摘要

Abstract:Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. Recent work has focused on using LLMs to improve review efficiency or generate insightful review content. However, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine the peer review ecosystem and compromise academic integrity. To address this critical issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews. ReviewGuard employs a comprehensive four-stage LLM-driven framework that: (1) collects ICLR and NeurIPS papers with their corresponding reviews from OpenReview; (2) annotates review types using GPT-4.1 with human validation; (3) addresses class imbalance and data scarcity through LLM-driven synthetic data augmentation, producing a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews; and (4) fine-tunes both encoder-based models and open source LLMs. We perform comprehensive feature analysis of the structure and quality of the review text. Compared to sufficient reviews, deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment. AI-generated text detection reveals that, since ChatGPT's emergence, AI-generated reviews have increased dramatically. In the evaluation of deficient review detection models, mixed training with synthetic and real review data provides substantial enhancements to recall and F1 scores on the binary task. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review while offering valuable insights into human-AI collaboration to maintain academic integrity.

115. 【2510.16499】Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

链接：https://arxiv.org/abs/2510.16499

作者：Michelle Yuan,Khushbu Pahwa,Shuaichen Chang,Mustafa Kaba,Jiarong Jiang,Xiaofei Ma,Yi Zhang,Monica Sunkara

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Designing effective agentic, Designing effective, uncertain environments, requires the seamless, models within dynamic

备注： Accepted to NeurIPS 2025 Conference

点击查看摘要

Abstract:Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.

116. 【2510.16492】Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

链接：https://arxiv.org/abs/2510.16492

作者：Vamshi Krishna Bonagiri,Ponnurangam Kumaragurum,Khanh Nguyen,Benjamin Plaut

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, Language Model, agents increasingly operate, increasingly operate

备注： Reliable ML and Regulatable ML workshops, Neurips 2025

点击查看摘要

Abstract:As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.

117. 【2510.16458】Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

链接：https://arxiv.org/abs/2510.16458

作者：Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Benjamin Roth,Barbara Plank

类目：Computation and Language (cs.CL)

关键词：Natural Language Inference, Language Inference datasets, Natural Language, Language Inference, exhibit human label

备注：

点击查看摘要

Abstract:Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning types. However, previous work applying such taxonomies has focused on within-label variation: cases where annotators agree on the final NLI label but provide different explanations. In contrast, this paper broadens the scope by examining how annotators may diverge not only in the reasoning type but also in the labeling step. We use explanations as a lens to decompose the reasoning process underlying NLI annotation and to analyze individual differences. We apply LiTEx to two NLI English datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide highly similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning types better reflects the semantic similarity of free-text explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.

118. 【2510.16455】RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

链接：https://arxiv.org/abs/2510.16455

作者：Deyi Ji,Yuekui Yang,Haiyang Wu,Shaoping Ma,Tianrun Chen,Lanyun Zhu

类目：Computation and Language (cs.CL)

关键词：ensuring platform compliance, existing methods struggle, video violation detection, Relative Policy Optimization, Group Relative Policy

备注： ACL 2025 (Oral, Industry Track)

点击查看摘要

Abstract:Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.

119. 【2510.16449】rajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

链接：https://arxiv.org/abs/2510.16449

作者：Bin Yu,Xinming Wang,Shijie Lian,Haotian Li,Changti Wu,Ruina Hu,Bailing Wang,Yuliang Wei,Kai Chen

类目：Computation and Language (cs.CL)

关键词：Large language models, shown remarkable progress, allocate additional compute, complex reasoning tasks, Large language

备注： 13 pages, 6 figures. Project website: [this https URL](https://zgca-ai4edu.github.io/TrajSelector)

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.

120. 【2510.16439】FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

链接：https://arxiv.org/abs/2510.16439

作者：Syed Rifat Raiyan,Md Farhan Ishmam,Abdullah Al Imran,Mohammad Ali Moni

类目：Computation and Language (cs.CL)

关键词：inflates monetary costs, Large language models, verbosity inflates monetary, Large language, carbon footprint

备注：

点击查看摘要

Abstract:Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: this https URL

121. 【2510.16435】What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

链接：https://arxiv.org/abs/2510.16435

作者：Lennart Wachowiak,Andrew Coles,Gerard Canal,Oya Celiktutan

类目：Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：large language models, questions, robots' ability, robots, dataset

备注：

点击查看摘要

Abstract:With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (22.5%), the robot's capabilities (12.7%), and performance assessments (11.3%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.

122. 【2510.16387】Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

链接：https://arxiv.org/abs/2510.16387

作者：Fu-An Chao,Bi-Cheng Yan,Berlin Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：automatic speech recognition, well-established automatic speech, spoken language assessment, explore the untapped, well-established automatic

备注：

点击查看摘要

Abstract:In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.

123. 【2510.16381】ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents

链接：https://arxiv.org/abs/2510.16381

作者：David Peer,Sebastian Stabinger

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：demonstrated impressive capabilities, Large Language Models, Large Language, including hallucinations, impressive capabilities

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness, including hallucinations, instability, and a lack of transparency. To address these challenges, we introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA). The core of our approach lies in decoupling tasks into two distinct phases: Offline knowledge ingestion and online task processing. During knowledge ingestion, an LLM translates an informal problem specification into a formal, symbolic knowledge base. This formal representation is crucial as it can be verified and refined by human experts, ensuring its correctness and alignment with domain requirements. In the subsequent task processing phase, each incoming input is encoded into the same formal language. A symbolic decision engine then utilizes this encoded input in conjunction with the formal knowledge base to derive a reliable result. Through an extensive evaluation on a complex reasoning task, we demonstrate that a concrete implementation of ATA is competitive with state-of-the-art end-to-end reasoning models in a fully automated setup while maintaining trustworthiness. Crucially, with a human-verified and corrected knowledge base, our approach significantly outperforms even larger models, while exhibiting perfect determinism, enhanced stability against input perturbations, and inherent immunity to prompt injection attacks. By generating decisions grounded in symbolic reasoning, ATA offers a practical and controllable architecture for building the next generation of transparent, auditable, and reliable autonomous agents.

124. 【2510.16380】MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

链接：https://arxiv.org/abs/2510.16380

作者：Yu Ying Chiu,Michael S. Lee,Rachel Calcott,Brandon Handoko,Paul de Font-Reaulx,Paula Rodriguez,Chen Bo Calvin Zhang,Ziwen Han,Udari Madhushani Sehwag,Yash Maurya,Christina Q Knight,Harry R. Lloyd,Florence Bacus,Mantas Mazeika,Bing Liu,Yejin Choi,Mitchell L Gordon,Sydney Levine

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：systems progress, decisions, moral, Reasoning, moral decisions

备注： 46 pages, 8 figures, 10 tables. Preprint

点击查看摘要

Abstract:As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

125. 【2510.16373】Navigating through the hidden embedding space: steering LLMs to improve mental health assessment

链接：https://arxiv.org/abs/2510.16373

作者：Federico Ravenda,Seyed Ali Bahrainian,Andrea Raballo,Antonietta Mira

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Mental Health, Large Language, evolution of Large, opening new opportunities

备注：

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) is transforming AI, opening new opportunities in sensitive and high-impact areas such as Mental Health (MH). Yet, despite these advancements, recent evidence reveals that smaller-scale models still struggle to deliver optimal performance in domain-specific applications. In this study, we present a cost-efficient yet powerful approach to improve MH assessment capabilities of an LLM, without relying on any computationally intensive techniques. Our lightweight method consists of a linear transformation applied to a specific layer's activations, leveraging steering vectors to guide the model's output. Remarkably, this intervention enables the model to achieve improved results across two distinct tasks: (1) identifying whether a Reddit post is useful for detecting the presence or absence of depressive symptoms (relevance prediction task), and (2) completing a standardized psychological screening questionnaire for depression based on users' Reddit post history (questionnaire completion task). Results highlight the untapped potential of steering mechanisms as computationally efficient tools for LLMs' MH domain adaptation.

126. 【2510.16363】End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction

链接：https://arxiv.org/abs/2510.16363

作者：Nilmadhab Das,Vishal Vaibhav,Yash Sunil Choudhary,V. Vijaya Saradhi,Ashish Anand

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Argument Mining, Argument Components, Claim etc., Argumentative Relations, complex argumentative structures

备注： Accepted version. To appear in IJCNN 2025

点击查看摘要

Abstract:Argument Mining (AM) helps in automating the extraction of complex argumentative structures such as Argument Components (ACs) like Premise, Claim etc. and Argumentative Relations (ARs) like Support, Attack etc. in an argumentative text. Due to the inherent complexity of reasoning involved with this task, modelling dependencies between ACs and ARs is challenging. Most of the recent approaches formulate this task through a generative paradigm by flattening the argumentative structures. In contrast to that, this study jointly formulates the key tasks of AM in an end-to-end fashion using Autoregressive Argumentative Structure Prediction (AASP) framework. The proposed AASP framework is based on the autoregressive structure prediction framework that has given good performance for several NLP tasks. AASP framework models the argumentative structures as constrained pre-defined sets of actions with the help of a conditional pre-trained language model. These actions build the argumentative structures step-by-step in an autoregressive manner to capture the flow of argumentative reasoning in an efficient way. Extensive experiments conducted on three standard AM benchmarks demonstrate that AASP achieves state-of-theart (SoTA) results across all AM tasks in two benchmarks and delivers strong results in one benchmark.

127. 【2510.16359】Utilising Large Language Models for Generating Effective Counter Arguments to Anti-Vaccine Tweets

链接：https://arxiv.org/abs/2510.16359

作者：Utsav Dhanuka,Soham Poddar,Saptarshi Ghosh

类目：Computation and Language (cs.CL)

关键词：critical societal goal, combatting vaccine skepticism, social media, societal goal, era where public

备注： 14 pages, 1 figure, work done as a part of [this http URL](http://B.Tech) project at IIT Kharagpur

点击查看摘要

Abstract:In an era where public health is increasingly influenced by information shared on social media, combatting vaccine skepticism and misinformation has become a critical societal goal. Misleading narratives around vaccination have spread widely, creating barriers to achieving high immunisation rates and undermining trust in health recommendations. While efforts to detect misinformation have made significant progress, the generation of real time counter-arguments tailored to debunk such claims remains an insufficiently explored area. In this work, we explore the capabilities of LLMs to generate sound counter-argument rebuttals to vaccine misinformation. Building on prior research in misinformation debunking, we experiment with various prompting strategies and fine-tuning approaches to optimise counter-argument generation. Additionally, we train classifiers to categorise anti-vaccine tweets into multi-labeled categories such as concerns about vaccine efficacy, side effects, and political influences allowing for more context aware rebuttals. Our evaluation, conducted through human judgment, LLM based assessments, and automatic metrics, reveals strong alignment across these methods. Our findings demonstrate that integrating label descriptions and structured fine-tuning enhances counter-argument effectiveness, offering a promising approach for mitigating vaccine misinformation at scale.

128. 【2510.16340】hinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

链接：https://arxiv.org/abs/2510.16340

作者：Pratham Singla,Shivank Garg,Ayush Singh,Ishan Garg,Ketan Suhaas Saichandran

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：endowed Large Language, Large Language Models, Large Language, supplementary planning tokens, Recent advances

备注：

点击查看摘要

Abstract:Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.

129. 【2510.16334】Investigating the Association Between Text-Based Indications of Foodborne Illness from Yelp Reviews and New York City Health Inspection Outcomes (2023)

链接：https://arxiv.org/abs/2510.16334

作者：Eden Shaveet,Crystal Su,Daniel Hsu,Luis Gravano

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：gastrointestinal conditions caused, consuming contaminated food, Foodborne illnesses, illnesses are gastrointestinal, gastrointestinal conditions

备注： Presented as a poster at Data Science Day 2024

点击查看摘要

Abstract:Foodborne illnesses are gastrointestinal conditions caused by consuming contaminated food. Restaurants are critical venues to investigate outbreaks because they share sourcing, preparation, and distribution of foods. Public reporting of illness via formal channels is limited, whereas social media platforms host abundant user-generated content that can provide timely public health signals. This paper analyzes signals from Yelp reviews produced by a Hierarchical Sigmoid Attention Network (HSAN) classifier and compares them with official restaurant inspection outcomes issued by the New York City Department of Health and Mental Hygiene (NYC DOHMH) in 2023. We evaluate correlations at the Census tract level, compare distributions of HSAN scores by prevalence of C-graded restaurants, and map spatial patterns across NYC. We find minimal correlation between HSAN signals and inspection scores at the tract level and no significant differences by number of C-graded restaurants. We discuss implications and outline next steps toward address-level analyses.

130. 【2510.16290】Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models

链接：https://arxiv.org/abs/2510.16290

作者：Yue Zheng,Xiufang Shi,Jiming Chen,Yuanchao Shu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：rapidly advanced, advanced by recent, recent development, development of Vision-Language, Vision-Language Models

备注：

点击查看摘要

Abstract:Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.

131. 【2510.16282】Instant Personalized Large Language Model Adaptation via Hypernetwork

链接：https://arxiv.org/abs/2510.16282

作者：Zhaoxuan Tan,Zixuan Zhang,Haoyang Wen,Zheng Li,Rongzhi Zhang,Pei Chen,Fengran Mo,Zheyuan Liu,Qingkai Zeng,Qingyu Yin,Meng Jiang

类目：Computation and Language (cs.CL)

关键词：Personalized large language, large language models, Personalized large, language models, tailor content

备注：

点击查看摘要

Abstract:Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.

132. 【2510.16257】owards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback

链接：https://arxiv.org/abs/2510.16257

作者：Chu Fei Luo,Samuel Dahan,Xiaodan Zhu

类目：Computation and Language (cs.CL)

关键词：impact on society, language models, greater impact, important to ensure, diverse range

备注： Findings of EMNLP 2025, 5 pages

点击查看摘要

Abstract:As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.

133. 【2510.16252】WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

链接：https://arxiv.org/abs/2510.16252

作者：Yuxuan Lu,Jing Huang,Hui Liu,Jiri Gesi,Yan Han,Shihan Fu,Tianqi Zheng,Dakuo Wang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：gained increasing attention, robust browser-side interaction, controllable server-side state, Reinforcement Learning, web agents

备注：

点击查看摘要

Abstract:Training and evaluation of Reinforcement Learning (RL) web agents have gained increasing attention, yet a scalable and efficient environment that couples realistic and robust browser-side interaction with controllable server-side state at scale is still missing. Existing environments tend to have one or more of the following issues: they overwhelm policy models with excessive and noisy context; they perform actions non-deterministically without waiting for the UI or network to stabilize; or they cannot scale isolated client-server containers effectively for parallel RL rollouts. We propose WEBSERV, an environment that includes 1) a compact, site-agnostic browser environment that balances context and action complexity, and 2) a scalable RL environment via efficient launching and resetting web-servers to enable scalable RL training and evaluation. We evaluate WEBSERV on the shopping CMS and Gitlab tasks in WebArena, achieving state-of-the-art single-prompt success rates while cutting launch latency by ~5x and storage need by ~240x, with a comparable memory footprint, enabling 200+ concurrent containers on a single host.

134. 【2510.16234】ScholarEval: Research Idea Evaluation Grounded in Literature

链接：https://arxiv.org/abs/2510.16234

作者：Hanane Nour Moussa,Patrick Queiroz Da Silva,Daniel Adu-Ampratwum,Alyson East,Zitong Lu,Nikki Puccetti,Mingyi Xue,Huan Sun,Bodhisattwa Prasad Majumder,Sachin Kumar

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：increasingly common, critical to ensure, research ideas based, research ideation, generated ideas

备注：

点击查看摘要

Abstract:As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas. We introduce ScholarEval, a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness - the empirical validity of proposed methods based on existing literature, and contribution - the degree of advancement made by the idea across different dimensions relative to prior research. To evaluate ScholarEval, we introduce ScholarIdeas, the first expert-annotated dataset of multi-domain research ideas and reviews, comprised of 117 ideas across four disciplines: artificial intelligence, neuroscience, biochemistry, and ecology. Our evaluation shows that ScholarEval achieves significantly higher coverage of points mentioned in the human expert annotated rubrics in ScholarIdeas compared to all baselines. Furthermore, ScholarEval is consistently preferred over our strongest baseline o4-mini-deep-research, a reasoning and search-enabled agentic system by OpenAI, in terms of evaluation actionability, depth, and evidence support. Our large-scale user study also shows that ScholarEval significantly outperforms deep research in literature engagement, idea refinement, and usefulness. We openly release our code, dataset, and ScholarEval tool for the community to use and build on.

135. 【2510.16227】What Can String Probability Tell Us About Grammaticality?

链接：https://arxiv.org/abs/2510.16227

作者：Jennifer Hu,Ethan Gotlieb Wilcox,Siyuan Song,Kyle Mahowald,Roger P. Levy

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：language models, probability, learned about grammar, Abstract, LMs

备注：

点击查看摘要

Abstract:What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.

136. 【2510.16198】EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

链接：https://arxiv.org/abs/2510.16198

作者：Mohamed Gamil,Abdelrahman Elsayed,Abdelrahman Lila,Ahmed Gad,Hesham Abdelgawad,Mohamed Aref,Ahmed Fares

类目：Computation and Language (cs.CL)

关键词：East and Africa, Middle East, recent advances, culturally diverse datasets, multimodal culturally diverse

备注：

点击查看摘要

Abstract:Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.

137. 【2510.16173】In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions

链接：https://arxiv.org/abs/2510.16173

作者：Aria Pessianzadeh,Naima Sultana,Hildegarde Van den Bulck,David Gefen,Shahin Jabari,Rezvaneh Rezapour

类目：Computation and Language (cs.CL)

关键词：human life, rise of generative, impacted many aspects, aspects of human, trust

备注：

点击查看摘要

Abstract:The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.

138. 【2510.16167】Alignment is Localized: A Causal Probe into Preference Layers

链接：https://arxiv.org/abs/2510.16167

作者：Archie Chaudhury

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Reinforcement Learning frameworks, Reinforcement Learning, increasingly popular method, Learning frameworks, policies or guidelines

备注：

点击查看摘要

Abstract:Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral policies or guidelines. Reinforcement Learning through Human Feedback (RLHF) is perhaps the most popular implementation of such a framework, particularly for aligning LMs toward safety and human intent. However, the internal workings of how such alignment is achieved remain largely opaque. In this work, we systematically analyze preference optimization for language model alignment by applying layer-wide causal patching between a base model and its tuned counterpart across human preference pairs. We implement our methodology on \textit{Llama-3.2-1B}, and find that alignment is spatially localized: mid-layer activations encode a distinct subspace that causally determines reward-consistent behavior, while early and late layers remain largely unaffected. Utilizing LASSO regression, we also find that only a small number of layers possess non-zero coefficients linking activation distances to reward gains. Overall, we show that, at least for some language models, alignment from human-based, preferential tuning is a directional, low rank process, rather than diffuse and parameteric.

139. 【2510.16157】Zeroth-Order Sharpness-Aware Learning with Exponential Tilting

链接：https://arxiv.org/abs/2510.16157

作者：Xuchen Gong,Tian Li

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词：randomly perturbed model, Classic zeroth-order optimization, perturbed model parameters, Classic zeroth-order, original function

备注：

点击查看摘要

Abstract:Classic zeroth-order optimization approaches typically optimize for a smoothed version of the original function, i.e., the expected objective under randomly perturbed model parameters. This can be interpreted as encouraging the loss values in the perturbation set to be small on average. Popular sharpness-aware minimization (SAM) objectives, however, typically focus on the largest loss within the neighborhood to arrive at flat minima more effectively. In this work, we connect zeroth-order optimization (and its corresponding objectives) with SAM approaches explicitly, through an exponential tilting objective that provides a smooth transition between the average- and the max-loss formulations. We explore new zeroth-order algorithms to solve a soft SAM objective parameterized by a tilting parameter $t$. We provide precise characterizations of the sharpness notions of the tilted SAM framework. Practically, our approach can be used as a gradient-free and memory-efficient alternative to SAM variants, and it achieves better generalization compared to vanilla zeroth-order baselines on a wide range of downstream tasks, including classification, multiple choice QA, and language generation.

140. 【2510.16152】Publication Trend Analysis and Synthesis via Large Language Model: A Case Study of Engineering in PNAS

链接：https://arxiv.org/abs/2510.16152

作者：Mason Smetana,Lev Khazanovich

类目：Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：sparse keyword systems, static disciplinary structures, potentially sparse keyword, static disciplinary, making it cumbersome

备注： 35 pages, 10 figures

点击查看摘要

Abstract:Scientific literature is increasingly siloed by complex language, static disciplinary structures, and potentially sparse keyword systems, making it cumbersome to capture the dynamic nature of modern science. This study addresses these challenges by introducing an adaptable large language model (LLM)-driven framework to quantify thematic trends and map the evolving landscape of scientific knowledge. The approach is demonstrated over a 20-year collection of more than 1,500 engineering articles published by the Proceedings of the National Academy of Sciences (PNAS), marked for their breadth and depth of research focus. A two-stage classification pipeline first establishes a primary thematic category for each article based on its abstract. The subsequent phase performs a full-text analysis to assign secondary classifications, revealing latent, cross-topic connections across the corpus. Traditional natural language processing (NLP) methods, such as Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), confirm the resulting topical structure and also suggest that standalone word-frequency analyses may be insufficient for mapping fields with high diversity. Finally, a disjoint graph representation between the primary and secondary classifications reveals implicit connections between themes that may be less apparent when analyzing abstracts or keywords alone. The findings show that the approach independently recovers much of the journal's editorially embedded structure without prior knowledge of its existing dual-classification schema (e.g., biological studies also classified as engineering). This framework offers a powerful tool for detecting potential thematic trends and providing a high-level overview of scientific progress.

141. 【2510.16122】he Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers

链接：https://arxiv.org/abs/2510.16122

作者：Owais Makroo,Siva Rajesh Kasa,Sumegh Roychowdhury,Karan Gupta,Nikhil Pattisapu,Santhosh Kasa,Sumit Negi

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词：Membership Inference Attacks, Inference Attacks, threat by enabling, enabling adversaries, adversaries to determine

备注：

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset. Despite extensive research on MIAs, systematic comparisons between generative and discriminative classifiers remain limited. This work addresses this gap by first providing theoretical motivation for why generative classifiers exhibit heightened susceptibility to MIAs, then validating these insights through comprehensive empirical evaluation. Our study encompasses discriminative, generative, and pseudo-generative text classifiers across varying training data volumes, evaluated on nine benchmark datasets. Employing a diverse array of MIA strategies, we consistently demonstrate that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage. Furthermore, we observe that the canonical inference approach commonly used in generative classifiers significantly amplifies this privacy risk. These findings reveal a fundamental utility-privacy trade-off inherent in classifier design, underscoring the critical need for caution when deploying generative classifiers in privacy-sensitive applications. Our results motivate future research directions in developing privacy-preserving generative classifiers that can maintain utility while mitigating membership inference vulnerabilities.

142. 【2510.16096】Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization

链接：https://arxiv.org/abs/2510.16096

作者：Tina Behnia,Puneesh Deora,Christos Thrampoulidis

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：making text fluent, blend statistical regularities, Language models, making text, text fluent

备注： 28 pages, 15 figures

点击查看摘要

Abstract:Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.

143. 【2510.16091】Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification

链接：https://arxiv.org/abs/2510.16091

作者：Binglan Han,Anuradha Mathrani,Teo Susnjak

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：prompting strategies interact, systematic literature reviews, study quantifies, quantifies how prompting, prompting strategies

备注：

点击查看摘要

Abstract:This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types (zero-shot, few-shot, chain-of-thought (CoT), CoT-few-shot, self-reflection) across relevance classification and six Level-2 tasks, using accuracy, precision, recall, and F1. Results show pronounced model-prompt interaction effects: CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models. GPT-4o and DeepSeek provide robust overall performance, while GPT-4o-mini performs competitively at a substantially lower dollar cost. A cost-performance analysis for relevance classification (per 1,000 abstracts) reveals large absolute differences among model-prompt pairings; GPT-4o-mini remains low-cost across prompts, and structured prompts (CoT/CoT-few-shot) on GPT-4o-mini offer attractive F1 at a small incremental cost. We recommend a staged workflow that (1) deploys low-cost models with structured prompts for first-pass screening and (2) escalates only borderline cases to higher-capacity models. These findings highlight LLMs' uneven but promising potential to automate literature screening. By systematically analyzing prompt-model interactions, we provide a comparative benchmark and practical guidance for task-adaptive LLM deployment.

144. 【2510.16079】EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

链接：https://arxiv.org/abs/2510.16079

作者：Rong Wu,Xiaoman Wang,Jianbiao Mei,Pinlong Cai,Daocheng Fu,Cheng Yang,Licheng Wen,Xuemeng Yang,Yufan Shen,Yuxin Wang,Botian Shi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Model, Current Large Language, Current Large, Language Model, Large Language

备注：

点击查看摘要

Abstract:Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at this https URL.

145. 【2510.16062】Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

链接：https://arxiv.org/abs/2510.16062

作者：Guiyao Tie,Zenghui Yuan,Zeli Zhao,Chaoran Hu,Tianhe Gu,Ruihang Zhang,Sizhe Zhang,Junran Wu,Xiaoyue Tu,Ming Jin,Qingsong Wen,Lixing Chen,Pan Zhou,Lichao Sun

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, language models, Self-correction, large language, critical component

备注： 38 pages, 25 figures, 8 tables

点击查看摘要

Abstract:Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: this https URL

146. 【2510.16059】SIADAFIX: issue description response for adaptive program repair

链接：https://arxiv.org/abs/2510.16059

作者：Xin Cao,Nan Yu

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：large language model-based, language model-based agents, propose utilizing fast, issue description response, program repair

备注： 20 pages, 3 figures

点击查看摘要

Abstract:We propose utilizing fast and slow thinking to enhance the capabilities of large language model-based agents on complex tasks such as program repair. In particular, we design an adaptive program repair method based on issue description response, called SIADAFIX. The proposed method utilizes slow thinking bug fix agent to complete complex program repair tasks, and employs fast thinking workflow decision components to optimize and classify issue descriptions, using issue description response results to guide the orchestration of bug fix agent workflows. SIADAFIX adaptively selects three repair modes, i.e., easy, middle and hard mode, based on problem complexity. It employs fast generalization for simple problems and test-time scaling techniques for complex problems. Experimental results on the SWE-bench Lite show that the proposed method achieves 60.67% pass@1 performance using the Claude-4 Sonnet model, reaching state-of-the-art levels among all open-source methods. SIADAFIX effectively balances repair efficiency and accuracy, providing new insights for automated program repair. Our code is available at this https URL.

147. 【2510.16057】Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus

链接：https://arxiv.org/abs/2510.16057

作者：Md Kamrul Siam,Md Jobair Hossain Faruk,Jerry Q. Cheng,Huanying Gu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：

备注： 7 pages (Accepted to IEEE BHI 2025)

点击查看摘要

None

148. 【2510.16054】PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

链接：https://arxiv.org/abs/2510.16054

作者：Zheng Hui,Yijiang River Dong,Sanhanat Sivapiromrat,Ehsan Shareghi,Nigel Collier

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Large Language Models, risk data exposure, Large Language, users submit queries, powerful proprietary LLM

备注：

点击查看摘要

Abstract:When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called PrivacyPAD to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments.

149. 【2510.16017】InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects

链接：https://arxiv.org/abs/2510.16017

作者：Ibrahim Sheikh Mohamed,Abdullah Yahya Abdullah Omaisan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)

关键词：closed circuit television, Infrastructure in smart, circuit television, smart cities, cities is increasingly

备注：

点击查看摘要

Abstract:Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.

150. 【2510.15990】Can GRPO Help LLMs Transcend Their Pretraining Origin?

链接：https://arxiv.org/abs/2510.15990

作者：Kangqi Ni,Zhen Tan,Zijie Liu,Pingzhi Li,Tianlong Chen

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

151. 【2510.15977】Bolster Hallucination Detection via Prompt-Guided Data Augmentation

链接：https://arxiv.org/abs/2510.15977

作者：Wenyun Li,Zheng Zhang,Dongmei Jiang,Xiangyuan Lan

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

152. 【2510.15972】Quantum NLP models on Natural Language Inference

链接：https://arxiv.org/abs/2510.15972

作者：Ling Sun,Peter Sullivan,Michael Martin,Yun Zhou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：natural language processing, embedding compositional structure, compositional structure directly, Natural Language Inference, Quantum natural language

备注： Accepted, presented, and to appear in the Proceedings of the Quantum AI and NLP 2025 Conference

点击查看摘要

Abstract:Quantum natural language processing (QNLP) offers a novel approach to semantic modeling by embedding compositional structure directly into quantum circuits. This paper investigates the application of QNLP models to the task of Natural Language Inference (NLI), comparing quantum, hybrid, and classical transformer-based models under a constrained few-shot setting. Using the lambeq library and the DisCoCat framework, we construct parameterized quantum circuits for sentence pairs and train them for both semantic relatedness and inference classification. To assess efficiency, we introduce a novel information-theoretic metric, Information Gain per Parameter (IGPP), which quantifies learning dynamics independent of model size. Our results demonstrate that quantum models achieve performance comparable to classical baselines while operating with dramatically fewer parameters. The Quantum-based models outperform randomly initialized transformers in inference and achieve lower test error on relatedness tasks. Moreover, quantum models exhibit significantly higher per-parameter learning efficiency (up to five orders of magnitude more than classical counterparts), highlighting the promise of QNLP in low-resource, structure-sensitive settings. To address circuit-level isolation and promote parameter sharing, we also propose a novel cluster-based architecture that improves generalization by tying gate parameters to learned word clusters rather than individual tokens.

153. 【2510.15964】Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity

链接：https://arxiv.org/abs/2510.15964

作者：Tuowei Wang,Kun Li,Zixu Hao,Donglin Bai,Ju Ren,Yaoxue Zhang,Ting Cao,Mao Yang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

154. 【2510.15951】Attention to Non-Adopters

链接：https://arxiv.org/abs/2510.15951

作者：Kaitlyn Zhou,Kristina Gligorić,Myra Cheng,Michelle S. Lam,Vyoma Raman,Boluwatife Aminu,Caeley Woo,Michael Brockman,Hannah Cha,Dan Jurafsky

类目：Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

155. 【2510.15898】HealthDial: A No-Code LLM-Assisted Dialogue Authoring Tool for Healthcare Virtual Agents

链接：https://arxiv.org/abs/2510.15898

作者：Farnaz Nouraei,Zhuorui Yong,Timothy Bickmore

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：

备注：

点击查看摘要

None

156. 【2510.15891】Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System

链接：https://arxiv.org/abs/2510.15891

作者：Ziv Ben-Zion,Paul Raffelhüschen,Max Zettl,Antonia Lüönd,Achim Burrer,Philipp Homan,Tobias R Spiller

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

157. 【2510.15889】Mitigating Harmful Erraticism in LLMs Through Dialectical Behavior Therapy Based De-Escalation Strategies

链接：https://arxiv.org/abs/2510.15889

作者：Pooja Rangarajan,Jacob Boyle

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注： 15 pages, 7 figures and 6 tables

点击查看摘要

None

158. 【2510.13939】Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

链接：https://arxiv.org/abs/2510.13939

作者：Tuhin Chakrabarty,Jane C. Ginsburg,Paramveer Dhillon

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：

备注： Preprint Under Review

点击查看摘要

None

159. 【2510.15929】Comparing LLMs for Sentiment Analysis in Financial Market News

链接：https://arxiv.org/abs/2510.15929

作者：Lucas Eduardo Pereira Teles,Carlos M. S. Figueiredo

类目：atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

信息检索

1. 【2510.17670】On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration

链接：https://arxiv.org/abs/2510.17670

作者：Yehonathan Refael,Amit Aides,Aviad Barzilai,George Leifman,Genady Beryozkin,Vered Silverman,Bolous Jaber,Tomer Shekel

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：arbitrary text queries, offer remarkable flexibility, Open-vocabulary object detection, models offer remarkable, text queries

备注：

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery this http URL core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art this http URL method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.

2. 【2510.17614】OG-Rank: Learning to Rank Fast and Slow with Uncertainty and Reward-Trend Guided Adaptive Exploration

链接：https://arxiv.org/abs/2510.17614

作者：Praphul Singh,Corey Barrett,Sumana Srivasta,Irfan Bulu,Sri Gadde,Krishnaram Kenthapadi

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Clinicians need ranking, justify their choices, ranking systems, systems that work, work in real

备注：

点击查看摘要

Abstract:Clinicians need ranking systems that work in real time and still justify their choices. Motivated by the need for a low-latency, decoder-based reranker, we present OG-Rank, a single-decoder approach that pairs a pooled first-token scoring signal with an uncertainty-gated explanation step. The model scores all candidates in one pass and generates a brief, structured rationale only when the list is genuinely ambiguous, keeping latency predictable. Trained with a curriculum that concentrates effort on hard cases, OG-Rank delivers strong effectiveness on encounter-scoped order selection (fast path: Recall@1~0.45, nDCG@20~0.625) and improves further when the gate activates (Recall@1~0.56, nDCG@20~0.699 at a 45\% gate rate), while compact backbones show similar gains under the same policy. Encoder baselines trail in both effectiveness and flexibility. The result is a practical recipe: rank fast by default and explain when it helps, a pattern that applies broadly to decision tasks where selective generation buys accuracy at acceptable cost. The single-policy design simplifies deployment and budget planning, and the curriculum principle (spend more on the hard cases, less on the easy ones) readily transfers beyond clinical order selection.

3. 【2510.17535】How role-play shapes relevance judgment in zero-shot LLM rankers

链接：https://arxiv.org/abs/2510.17535

作者：Yumeng Wang,Jirui Qi,Catherine Chen,Panagiotis Eustratiadis,Suzan Verberne

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, emerged as promising, performance is highly

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.

4. 【2510.17354】owards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2510.17354

作者：Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：enhancing large language, large language models, external corpus, powerful paradigm, paradigm for enhancing

备注： This work is in progress

点击查看摘要

5. 【2510.17281】MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

链接：https://arxiv.org/abs/2510.17281

作者：Qingyao Ai,Yichen Tang,Changyue Wang,Jianming Long,Weihang Su,Yiqun Liu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：computational resource consumption, marginal gains obtained, larger computational resource, improve LLM systems, Scaling up data

备注：

点击查看摘要

Abstract:Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.

6. 【2510.17245】On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders

链接：https://arxiv.org/abs/2510.17245

作者：Wenyu Mao,Jiancan Wu,Guoqing Hu,Wei Ji,Xiang Wang

类目：Information Retrieval (cs.IR)

关键词：generative sequential recommendation, user interaction histories, Diffusion models, multi-step denoising process, powerful paradigm

备注：

点击查看摘要

Abstract:Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effectiveness. To address this trade-off, we propose TA-Rec, a two-stage framework that achieves one-step generation by smoothing the denoising function during pretraining while alleviating trajectory deviation by aligning with user preferences during fine-tuning. Specifically, to improve the efficiency without sacrificing the recommendation performance, TA-Rec pretrains the denoising model with Temporal Consistency Regularization (TCR), enforcing the consistency between the denoising results across adjacent steps. Thus, we can smooth the denoising function to map the noise as oracle items in one step with bounded error. To further enhance effectiveness, TA-Rec introduces Adaptive Preference Alignment (APA) that aligns the denoising process with user preference adaptively based on preference pair similarity and timesteps. Extensive experiments prove that TA-Rec's two-stage objective effectively mitigates the discretization errors-induced trade-off, enhancing both efficiency and effectiveness of diffusion-based recommenders.

7. 【2510.17228】DSEBench: A Test Collection for Explainable Dataset Search with Examples

链接：https://arxiv.org/abs/2510.17228

作者：Qing Shi,Jing He,Qiaosheng Chen,Gong Cheng

类目：Information Retrieval (cs.IR)

关键词：Dataset search, established information retrieval, Explainable DSE, Dataset, input target dataset

备注： 34 pages, 5 figures, submitted to Knowledge-Based Systems

点击查看摘要

Abstract:Dataset search has been an established information retrieval task. Current paradigms either retrieve datasets that are relevant to a keyword query or find datasets that are similar to an input target dataset. To allow for their combined specification of information needs, in this article, we investigate the more generalized task of Dataset Search with Examples (DSE) and further extend it to Explainable DSE that requires identifying the metadata and content fields of a dataset that indicate its relevance to the query and similarity to the target datasets. To facilitate this research, we construct DSEBench, a test collection that provides high-quality dataset- and field-level annotations to enable the evaluation of explainable DSE. We also employ a large language model to generate numerous annotations to be used for training. We establish extensive baselines on DSEBench by adapting and evaluating a variety of sparse, dense, and LLM-based retrieval, reranking, and explanation methods.

8. 【2510.17139】Rethinking On-policy Optimization for Query Augmentation

链接：https://arxiv.org/abs/2510.17139

作者：Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Recent advances, large language models, advances in large, large language, surge of interest

备注：

点击查看摘要

9. 【2510.16925】owards Context-aware Reasoning-enhanced Generative Searching in E-commerce

链接：https://arxiv.org/abs/2510.16925

作者：Zhiding Liu,Ben Chen,Mingyue Cheng,Enchong Chen,Li Li,Chenyi Lei,Wenwu Ou,Han Li,Kun Gai

类目：Information Retrieval (cs.IR)

关键词：critical application scenarios, critical application, current query information, search, application scenarios

备注：

点击查看摘要

Abstract:Search-based recommendation is one of the most critical application scenarios in e-commerce platforms. Users' complex search contexts--such as spatiotemporal factors, historical interactions, and current query's information--constitute an essential part of their decision-making, reflecting implicit preferences that complement explicit query terms. Modeling such rich contextual signals and their intricate associations with candidate items remains a key challenge. Although numerous efforts have been devoted to building more effective search methods, existing approaches still show limitations in integrating contextual information, which hinders their ability to fully capture user intent. To address these challenges, we propose a context-aware reasoning-enhanced generative search framework for better \textbf{understanding the complicated context}. Specifically, the framework first unifies heterogeneous user and item contexts into textual representations or text-based semantic identifiers and aligns them. To overcome the lack of explicit reasoning trajectories, we introduce a self-evolving post-training paradigm that iteratively combines supervised fine-tuning and reinforcement learning to progressively enhance the model's reasoning capability. In addition, we identify potential biases in existing RL algorithms when applied to search scenarios and present a debiased variant of GRPO to improve ranking performance. Extensive experiments on search log data collected from a real-world e-commerce platform demonstrate that our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2510.16925 [cs.IR]

(or
arXiv:2510.16925v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.16925

Focus to learn more

              arXiv-issued DOI via DataCite</p>

10. 【2510.16804】he Layout Is the Model: On Action-Item Coupling in Generative Recommendation

链接：https://arxiv.org/abs/2510.16804

作者：Xiaokai Wei,Jiajun Wu,Daiyao Yi,Reza Shirkavand,Michelle Gong

类目：Information Retrieval (cs.IR)

关键词：Generative Recommendation, user interaction history, autoregressively predicted, treat a user, user interaction

备注：

点击查看摘要

Abstract:Generative Recommendation (GR) models treat a user's interaction history as a sequence to be autoregressively predicted. When both items and actions (e.g., watch time, purchase, comment) are modeled, the layout-the ordering and visibility of item/action tokens-critically determines what information the model can use and how it generalizes. We present a unified study of token layouts for GR grounded in first principles: (P1) maximize item/action signal in both input/output space, (P2) preserve the conditioning relationship "action given item" and (P3) no information leakage. While interleaved layout (where item and action occupy separate tokens) naturally satisfies these principles, it also bloats sequence length with larger training/inference cost. On the non-interleaved front, we design a novel and effective approach, Lagged Action Conditioning (LAC), which appears strange on the surface but aligns well with the design principles to yield strong accuracy. Comprehensive experiments on public datasets and large-scale production logs evaluate different layout options and empirically verifies the design principles. Our proposed non-interleaved method, LAC, achieves competitive or superior quality at substantially lower FLOPs than interleaving. Our findings offer actionable guidance for assembling GR systems that are both accurate and efficient.

Subjects:

Information Retrieval (cs.IR)

ACMclasses:
H.3.3

Cite as:
arXiv:2510.16804 [cs.IR]

(or
arXiv:2510.16804v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.16804

Focus to learn more

              arXiv-issued DOI via DataCite</p>

11. 【2510.16803】An Efficient Framework for Whole-Page Reranking via Single-Modal Supervision

链接：https://arxiv.org/abs/2510.16803

作者：Zishuai Zhang,Sihao Yu,Wenyi Xie,Ying Nie,Junfeng Wang,Zhiming Zheng,Dawei Yin,Hainan Zhang

类目：Information Retrieval (cs.IR)

关键词：integrates retrieval results, LLM outputs, plays a critical, critical role, role in shaping

备注：

点击查看摘要

Abstract:The whole-page reranking plays a critical role in shaping the user experience of search engines, which integrates retrieval results from multiple modalities, such as documents, images, videos, and LLM outputs. Existing methods mainly rely on large-scale human-annotated data, which is costly to obtain and time-consuming. This is because whole-page annotation is far more complex than single-modal: it requires assessing the entire result page while accounting for cross-modal relevance differences. Thus, how to improve whole-page reranking performance while reducing annotation costs is still a key challenge in optimizing search engine result pages(SERP). In this paper, we propose SMAR, a novel whole-page reranking framework that leverages strong Single-modal rankers to guide Modal-wise relevance Alignment for effective Reranking, using only limited whole-page annotation to outperform fully-annotated reranking models. Specifically, high-quality single-modal rankers are first trained on data specific to their respective modalities. Then, for each query, we select a subset of their outputs to construct candidate pages and perform human annotation at the page level. Finally, we train the whole-page reranker using these limited annotations and enforcing consistency with single-modal preferences to maintain ranking quality within each modality. Experiments on the Qilin and Baidu datasets demonstrate that SMAR reduces annotation costs by about 70-90\% while achieving significant ranking improvements compared to baselines. Further offline and online A/B testing on Baidu APPs also shows notable gains in standard ranking metrics as well as user experience indicators, fully validating the effectiveness and practical value of our approach in real-world search scenarios.

12. 【2510.16736】Exact Nearest-Neighbor Search on Energy-Efficient FPGA Devices

链接：https://arxiv.org/abs/2510.16736

作者：Patrizio Dazzi,William Guglielmo,Franco Maria Nardini,Raffaele Perego,Salvatore Trani

类目：Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：high-dimension latent spaces, exact kNN search, latent spaces, energy-efficient exact kNN, investigates the usage

备注：

点击查看摘要

Abstract:This paper investigates the usage of FPGA devices for energy-efficient exact kNN search in high-dimension latent spaces. This work intercepts a relevant trend that tries to support the increasing popularity of learned representations based on neural encoder models by making their large-scale adoption greener and more inclusive. The paper proposes two different energy-efficient solutions adopting the same FPGA low-level configuration. The first solution maximizes system throughput by processing the queries of a batch in parallel over a streamed dataset not fitting into the FPGA memory. The second minimizes latency by processing each kNN incoming query in parallel over an in-memory dataset. Reproducible experiments on publicly available image and text datasets show that our solution outperforms state-of-the-art CPU-based competitors regarding throughput, latency, and energy consumption. Specifically, experiments show that the proposed FPGA solutions achieve the best throughput in terms of queries per second and the best-observed latency with scale-up factors of up to 16.6X. Similar considerations can be made regarding energy efficiency, where results show that our solutions can achieve up to 11.9X energy saving w.r.t. strong CPU-based competitors.

13. 【2510.16715】Right Answer at the Right Time - Temporal Retrieval-Augmented Generation via Graph Summarization

链接：https://arxiv.org/abs/2510.16715

作者：Zulun Zhu,Haoyu Liu,Mengke He,Siqiang Luo

类目：Information Retrieval (cs.IR)

关键词：Question answering, knowledge graphs requires, graphs requires retrieval, temporal knowledge graphs, Question

备注：

点击查看摘要

Abstract:Question answering in temporal knowledge graphs requires retrieval that is both time-consistent and efficient. Existing RAG methods are largely semantic and typically neglect explicit temporal constraints, which leads to time-inconsistent answers and inflated token usage. We propose STAR-RAG, a temporal GraphRAG framework that relies on two key ideas: building a time-aligned rule graph and conducting propagation on this graph to narrow the search space and prioritize semantically relevant, time-consistent evidence. This design enforces temporal proximity during retrieval, reduces the candidate set of retrieval results, and lowers token consumption without sacrificing accuracy. Compared with existing temporal RAG approaches, STAR-RAG eliminates the need for heavy model training and fine-tuning, thereby reducing computational cost and significantly simplifying this http URL experiments on real-world temporal KG datasets show that our method achieves improved answer accuracy while consuming fewer tokens than strong GraphRAG baselines.

14. 【2510.16695】Resolution-Aware Retrieval Augmented Zero-Shot Forecasting

链接：https://arxiv.org/abs/2510.16695

作者：Iman Deznabi,Peeyush Kumar,Madalina Fiterau

类目：Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词：previously unseen conditions, posing a significant, Zero-shot forecasting aims, aims to predict, predict outcomes

备注：

点击查看摘要

Abstract:Zero-shot forecasting aims to predict outcomes for previously unseen conditions without direct historical data, posing a significant challenge for traditional forecasting methods. We introduce a Resolution-Aware Retrieval-Augmented Forecasting model that enhances predictive accuracy by leveraging spatial correlations and temporal frequency characteristics. By decomposing signals into different frequency components, our model employs resolution-aware retrieval, where lower-frequency components rely on broader spatial context, while higher-frequency components focus on local influences. This allows the model to dynamically retrieve relevant data and adapt to new locations with minimal historical context. Applied to microclimate forecasting, our model significantly outperforms traditional forecasting methods, numerical weather prediction models, and modern foundation time series models, achieving 71% lower MSE than HRRR and 34% lower MSE than Chronos on the ERA5 dataset. Our results highlight the effectiveness of retrieval-augmented and resolution-aware strategies, offering a scalable and data-efficient solution for zero-shot forecasting in microclimate modeling and beyond.

Subjects:

Machine Learning (cs.LG); Information Retrieval (cs.IR)

Cite as:
arXiv:2510.16695 [cs.LG]

(or
arXiv:2510.16695v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.16695

Focus to learn more

              arXiv-issued DOI via DataCite</p>

15. 【2510.16662】Safire: Similarity Framework for Visualization Retrieval

链接：https://arxiv.org/abs/2510.16662

作者：Huyen N. Nguyen,Nils Gehlenborg

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Effective visualization retrieval, Effective visualization, visualization retrieval necessitates, visualization retrieval, visualization retrieval systems

备注： To appear in IEEE VIS 2025

点击查看摘要

Abstract:Effective visualization retrieval necessitates a clear definition of similarity. Despite the growing body of work in specialized visualization retrieval systems, a systematic approach to understanding visualization similarity remains absent. We introduce the Similarity Framework for Visualization Retrieval (Safire), a conceptual model that frames visualization similarity along two dimensions: comparison criteria and representation modalities. Comparison criteria identify the aspects that make visualizations similar, which we divide into primary facets (data, visual encoding, interaction, style, metadata) and derived properties (data-centric and human-centric measures). Safire connects what to compare with how comparisons are executed through representation modalities. We categorize existing representation approaches into four groups based on their levels of information content and visualization determinism: raster image, vector image, specification, and natural language description, together guiding what is computable and comparable. We analyze several visualization retrieval systems using Safire to demonstrate its practical value in clarifying similarity considerations. Our findings reveal how particular criteria and modalities align across different use cases. Notably, the choice of representation modality is not only an implementation detail but also an important decision that shapes retrieval capabilities and limitations. Based on our analysis, we provide recommendations and discuss broader implications for multimodal learning, AI applications, and visualization reproducibility.

16. 【2510.16635】Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis

链接：https://arxiv.org/abs/2510.16635

作者：Wonduk Seo,Juhyeon Lee,Junseo Koh,Hyunjin An,Jian Park,Seunghyun Lee,Haihua Chen,Yi Bu

类目：Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, performance of Large, effective alternative

备注： Preprint

点击查看摘要

17. 【2510.16597】FRONTIER-RevRec: A Large-scale Dataset for Reviewer Recommendation

链接：https://arxiv.org/abs/2510.16597

作者：Qiyao Peng,Chen Wang,Yinghui Wang,Hongtao Liu,Xuan Guo,Wenjun Wang

类目：Information Retrieval (cs.IR)

关键词：critical task, task for enhancing, enhancing the efficiency, academic publishing workflows, publishing workflows

备注：

点击查看摘要

Abstract:Reviewer recommendation is a critical task for enhancing the efficiency of academic publishing workflows. However, research in this area has been persistently hindered by the lack of high-quality benchmark datasets, which are often limited in scale, disciplinary scope, and comparative analyses of different methodologies. To address this gap, we introduce FRONTIER-RevRec, a large-scale dataset constructed from authentic peer review records (2007-2025) from the Frontiers open-access publishing platform this https URL. The dataset contains 177941 distinct reviewers and 478379 papers across 209 journals spanning multiple disciplines including clinical medicine, biology, psychology, engineering, and social sciences. Our comprehensive evaluation on this dataset reveals that content-based methods significantly outperform collaborative filtering. This finding is explained by our structural analysis, which uncovers fundamental differences between academic recommendation and commercial domains. Notably, approaches leveraging language models are particularly effective at capturing the semantic alignment between a paper's content and a reviewer's expertise. Furthermore, our experiments identify optimal aggregation strategies to enhance the recommendation pipeline. FRONTIER-RevRec is intended to serve as a comprehensive benchmark to advance research in reviewer recommendation and facilitate the development of more effective academic peer review systems. The FRONTIER-RevRec dataset is available at: this https URL.

18. 【2510.16576】Enhancing Channel Estimation in RIS-aided Systems via Observation Matrix Design

链接：https://arxiv.org/abs/2510.16576

作者：Zijian Zhang,Mingyao Cui

类目：Information Theory (cs.IT); Information Retrieval (cs.IR); Signal Processing (eess.SP); Systems and Control (eess.SY)

关键词：Reconfigurable intelligent surfaces, dense antenna arrays, enhancing wireless communications, Reconfigurable intelligent, intelligent surfaces

备注： 5 pages, 2 figures

点击查看摘要

Abstract:Reconfigurable intelligent surfaces (RISs) have emerged as a promising technology for enhancing wireless communications through dense antenna arrays. Accurate channel estimation is critical to unlocking their full performance potential. To enhance RIS channel estimators, this paper proposes a novel observation matrix design scheme. Bayesian optimization framework is adopted to generate observation matrices that maximize the mutual information between received pilot signals and RIS channels. To solve the formulated problem efficiently, we develop an alternating Riemannian manifold optimization (ARMO) algorithm to alternately update the receiver combiners and RIS phase-shift matrices. An adaptive kernel training strategy is further introduced to iteratively refine the channel covariance matrix without requiring additional pilot resources. Simulation results demonstrate that the proposed ARMO-enhanced estimator achieves substantial gains in estimation accuracy over state-of-the-art methods.

19. 【2510.16393】Blending Learning to Rank and Dense Representations for Efficient and Effective Cascades

链接：https://arxiv.org/abs/2510.16393

作者：Franco Maria Nardini,Raffaele Perego,Nicola Tonellotto,Salvatore Trani

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG); Performance (cs.PF)

关键词：ad-hoc passage retrieval, neural relevance signals, investigate the exploitation, dense neural representations, hand-crafted lexical features

备注：

点击查看摘要

Abstract:We investigate the exploitation of both lexical and neural relevance signals for ad-hoc passage retrieval. Our exploration involves a large-scale training dataset in which dense neural representations of MS-MARCO queries and passages are complemented and integrated with 253 hand-crafted lexical features extracted from the same corpus. Blending of the relevance signals from the two different groups of features is learned by a classical Learning-to-Rank (LTR) model based on a forest of decision trees. To evaluate our solution, we employ a pipelined architecture where a dense neural retriever serves as the first stage and performs a nearest-neighbor search over the neural representations of the documents. Our LTR model acts instead as the second stage that re-ranks the set of candidates retrieved by the first stage to enhance effectiveness. The results of reproducible experiments conducted with state-of-the-art dense retrievers on publicly available resources show that the proposed solution significantly enhances the end-to-end ranking performance while relatively minimally impacting efficiency. Specifically, we achieve a boost in nDCG@10 of up to 11% with an increase in average query latency of only 4.3%. This confirms the advantage of seamlessly combining two distinct families of signals that mutually contribute to retrieval effectiveness.

20. 【2510.16334】Investigating the Association Between Text-Based Indications of Foodborne Illness from Yelp Reviews and New York City Health Inspection Outcomes (2023)

链接：https://arxiv.org/abs/2510.16334

作者：Eden Shaveet,Crystal Su,Daniel Hsu,Luis Gravano

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：gastrointestinal conditions caused, consuming contaminated food, Foodborne illnesses, illnesses are gastrointestinal, gastrointestinal conditions

备注： Presented as a poster at Data Science Day 2024

点击查看摘要

21. 【2510.16302】DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA

链接：https://arxiv.org/abs/2510.16302

作者：Changhao Wang,Yanfang Liu,Xinxin Fan,Anzhi Zhou,Lao Tian,Yunfeng Lu

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：large language models, modern large language, Multi-hop reasoning, reasoning, plays a critical

备注： 13 pages, 5 figures

点击查看摘要

Abstract:Multi-hop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for modern large language models (LLMs). The accurate answer can be obtained through retrieving relational structure of entities from knowledge graph (KG). Regarding the inherent relation-dependency and reasoning pattern, multi-hop reasoning can be in general classified into two categories: i) parallel fact-verification multi-hop reasoning question, i.e., requiring simultaneous verifications of multiple independent sub-questions; and ii) chained multi-hop reasoning questions, i.e., demanding sequential multi-step inference with intermediate conclusions serving as essential premises for subsequent reasoning. Currently, the multi-hop reasoning approaches singly employ one of two techniques: LLM response-based fact verification and KG path-based chain construction. Nevertheless, the former excels at parallel fact-verification but underperforms on chained reasoning tasks, while the latter demonstrates proficiency in chained multi-hop reasoning but suffers from redundant path retrieval when handling parallel fact-verification reasoning. These limitations deteriorate the efficiency and accuracy for multi-hop QA tasks. To address this challenge, we propose a novel dual-track KG verification and reasoning framework DTKG, which is inspired by the Dual Process Theory in cognitive science. Specifically, DTKG comprises two main stages: the Classification Stage and the Branch Processing Stage.

22. 【2510.16076】BPL: Bias-adaptive Preference Distillation Learning for Recommender System

链接：https://arxiv.org/abs/2510.16076

作者：SeongKu Kang,Jianxun Lian,Dongha Lee,Wonbin Kweon,Sanghwan Jang,Jaehyun Lee,Jindong Wang,Xing Xie,Hwanjo Yu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Recommender systems suffer, Recommender systems, incompletely reveal user, test, reveal user preference

备注： \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Recommender systems suffer from biases that cause the collected feedback to incompletely reveal user preference. While debiasing learning has been extensively studied, they mostly focused on the specialized (called counterfactual) test environment simulated by random exposure of items, significantly degrading accuracy in the typical (called factual) test environment based on actual user-item interactions. In fact, each test environment highlights the benefit of a different aspect: the counterfactual test emphasizes user satisfaction in the long-terms, while the factual test focuses on predicting subsequent user behaviors on platforms. Therefore, it is desirable to have a model that performs well on both tests rather than only one. In this work, we introduce a new learning framework, called Bias-adaptive Preference distillation Learning (BPL), to gradually uncover user preferences with dual distillation strategies. These distillation strategies are designed to drive high performance in both factual and counterfactual test environments. Employing a specialized form of teacher-student distillation from a biased model, BPL retains accurate preference knowledge aligned with the collected feedback, leading to high performance in the factual test. Furthermore, through self-distillation with reliability filtering, BPL iteratively refines its knowledge throughout the training process. This enables the model to produce more accurate predictions across a broader range of user-item combinations, thereby improving performance in the counterfactual test. Comprehensive experiments validate the effectiveness of BPL in both factual and counterfactual tests. Our implementation is accessible via: this https URL.

计算机视觉

1. 【2510.17803】ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

链接：https://arxiv.org/abs/2510.17803

作者：Zixin Yin,Ling-Hao Chen,Lionel Ni,Xili Dai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, existing generation models, efficient text-guided editing, text-guided editing capabilities, generation models

备注： SIGGRAPH Asia 2025

点击查看摘要

Abstract:Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

2. 【2510.17800】Glyph: Scaling Context Windows via Visual-Text Compression

链接：https://arxiv.org/abs/2510.17800

作者：Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large language models, Large language, increasingly rely, multi-step reasoning, Large

备注：

点击查看摘要

3. 【2510.17790】UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

链接：https://arxiv.org/abs/2510.17790

作者：Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：require accurate visual, accurate visual grounding, lengthy execution chains, Multimodal agents, programmatic tool calls

备注：

点击查看摘要

4. 【2510.17783】Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats

链接：https://arxiv.org/abs/2510.17783

作者：Simeon Adebola,Chung Min Kim,Justin Kerr,Shuangyu Xie,Prithvi Akella,Jose Luis Susa Rincon,Eugen Solowjow,Ken Goldberg

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Commercial plant phenotyping, Commercial plant, segmentated Gaussian Splat, Gaussian Splat models, plant phenotyping systems

备注： 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

点击查看摘要

Abstract:Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed "annotated digital twins" of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at this https URL.

5. 【2510.17777】SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

链接：https://arxiv.org/abs/2510.17777

作者：Samir Khaki,Junxian Guo,Jiaming Tang,Shang Yang,Yukang Chen,Konstantinos N. Plataniotis,Yao Lu,Song Han,Zhijian Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Language Models, Vision Language, Language Models, high-resolution image understanding, long-video analysis

备注：

点击查看摘要

Abstract:Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

6. 【2510.17773】owards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion

链接：https://arxiv.org/abs/2510.17773

作者：Md. Enamul Atiq,Shaikh Anowarul Fattah

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：early detection significantly, life-threatening disease, disease where early, early detection, Dual Attention Gates

备注： 15 pages, 7 Figures, 3 Tables

点击查看摘要

Abstract:Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as "black boxes," limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model's reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model's predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.

7. 【2510.17771】Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

链接：https://arxiv.org/abs/2510.17771

作者：Zhining Liu,Ziyi Chen,Hui Liu,Chen Luo,Xianfeng Tang,Suhang Wang,Joy Zeng,Zhenwei Dai,Zhan Shi,Tianxin Wei,Benoit Dumoulin,Hanghang Tong

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-Language Models, visual question answering, achieve strong results, achieve strong, question answering

备注： 21 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

8. 【2510.17759】VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

链接：https://arxiv.org/abs/2510.17759

作者：Qilin Liao,Anamika Lochab,Ruqi Zhang

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：extend large language, large language models, extend large, visual reasoning, large language

备注： 18 pages, 7 Figures,

点击查看摘要

9. 【2510.17739】Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition

链接：https://arxiv.org/abs/2510.17739

作者：Timur Ismagilov,Shakaiba Majeed,Michael Milford,Tan Viet Tuyen Nguyen,Sarvapali D. Ramchurn,Shoaib Ehsan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：improve localisation performance, visual place recognition, reference sets captured, address multi-reference visual, localisation performance

备注： 13 pages

点击查看摘要

Abstract:We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.

10. 【2510.17731】Can Image-To-Video Models Simulate Pedestrian Dynamics?

链接：https://arxiv.org/abs/2510.17731

作者：Aaron Appelle,Jerome P. Lynch

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scale video datasets, Recent high-performing, displayed remarkable inherent, remarkable inherent world-modeling, inherent world-modeling capabilities

备注： Appeared in the ICML 2025 Workshop on Building Physically Plausible World Models, July 2025, [this https URL](https://physical-world-modeling.github.io/)

点击查看摘要

Abstract:Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.

11. 【2510.17724】Signature Forgery Detection: Improving Cross-Dataset Generalization

链接：https://arxiv.org/abs/2510.17724

作者：Matheus Ramos Parracho

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：critical biometric technique, Automated signature verification, identity authentication, Automated signature, legal documentation

备注： Undergraduate thesis (preprint)---submitted to Escola Politécnica, Universidade Federal do Rio de Janeiro (POLI/UFRJ). The final version will include official signatures and defense approval

点击查看摘要

Abstract:Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization -- that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks -- CEDAR, ICDAR, and GPDS Synthetic -- two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.

12. 【2510.17722】MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

链接：https://arxiv.org/abs/2510.17722

作者：Yaning Pan,Zekun Wang,Qianqian Xie,Yongqian Wen,Yuanxing Zhang,Guohui Zhang,Haoxuan Hu,Zhiyu Pan,Yibing Huang,Zhidong Gan,Yonghong Lin,An Ping,Tianhao Peng,Jiaheng Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注： Project Website: [this https URL](https://github.com/NJU-LINK/MT-Video-Bench)

点击查看摘要

Abstract:The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

13. 【2510.17719】Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions

链接：https://arxiv.org/abs/2510.17719

作者：Zhiqiang Teng,Beibei Lin,Tingting Chen,Zifeng Yuan,Xuanyi Li,Xuanyu Zhang,Shunli Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, substantially degrading reconstruction, optical distortions caused, substantially degrading, suffers from severe

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.

14. 【2510.17716】Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging

链接：https://arxiv.org/abs/2510.17716

作者：Suqiang Ma,Subhadeep Sengupta,Yao Lee,Beikang Gu,Xianyan Chen,Xianqiao Wang,Yang Liu,Mengjia Xu,Galit H. Frydman,He Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Circulating blood cell, significant biomarkers linked, cell, cell clusters, Circulating blood

备注：

点击查看摘要

Abstract:Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.

15. 【2510.17703】Improving Cross-Patient Generalization in Parkinson's Disease Detection through Chunk-Based Analysis of Hand-Drawn Patterns

链接：https://arxiv.org/abs/2510.17703

作者：Mhd Adnan Albani,Riad Sonbol

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：causing motor impairments, impede hand coordination, hand coordination activities, Parkinson disease, neurodegenerative disease affecting

备注： 19 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Parkinson's disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson's disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson's disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson's disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson's disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.

16. 【2510.17700】Elastic ViTs from Pretrained Models without Retraining

链接：https://arxiv.org/abs/2510.17700

作者：Walter Simoncini,Michael Dorkenwald,Tijmen Blankevoort,Cees G.M. Snoek,Yuki M. Asano

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：forcing sub-optimal deployment, sub-optimal deployment choices, Vision foundation models, Vision Transformers, foundation models achieve

备注： Accepted at NeurIPS 2025

点击查看摘要

Abstract:Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: this https URL

17. 【2510.17699】GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver

链接：https://arxiv.org/abs/2510.17699

作者：Aleksandr Oganov,Ilya Bykov,Eva Neudachina,Mishan Aliev,Alexander Tolmachev,Alexander Sidorov,Aleksandr Zuev,Andrey Okhotin,Denis Rakitin,Aibek Alanov

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：diffusion models achieve, computationally expensive sampling, models achieve, suffer from computationally, computationally expensive

备注：

点击查看摘要

Abstract:While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at this https URL.

18. 【2510.17686】owards 3D Objectness Learning in an Open World

链接：https://arxiv.org/abs/2510.17686

作者：Taichi Liu,Zhenyu Wang,Ruofeng Liu,Guang Wang,Desheng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made significant progress, objectness remains insufficient, Recent advancements, category detection, significant progress

备注： Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.

19. 【2510.17685】Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning

链接：https://arxiv.org/abs/2510.17685

作者：Min Cao,Xinyu Zhou,Ding Jiang,Bo Du,Mang Ye,Min Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：person retrieval, aims to identify, textual descriptions, facing challenge, target person

备注： Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: [this https URL](https://ieeexplore.ieee.org/document/11199360)

点击查看摘要

Abstract:Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in this https URL.

20. 【2510.17684】Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model

链接：https://arxiv.org/abs/2510.17684

作者：Xinwei Zhang,Hu Chen,Zhe Yuan,Sukun Tian,Peng Feng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved remarkable performance, image segmentation foundation, medical image segmentation, image segmentation, medical image

备注：

点击查看摘要

Abstract:Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.

21. 【2510.17681】PICABench: How Far Are We from Physically Realistic Image Editing?

链接：https://arxiv.org/abs/2510.17681

作者：Yuandong Pu,Le Zhuo,Songhao Han,Jinbo Xing,Kaiwen Zhu,Shuo Cao,Bin Fu,Si Liu,Hongsheng Li,Yu Qiao,Wenlong Zhang,Xi Chen,Yihao Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remarkable progress recently, achieved remarkable progress, progress recently, achieved remarkable, remarkable progress

备注：

点击查看摘要

Abstract:Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

22. 【2510.17664】4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads

链接：https://arxiv.org/abs/2510.17664

作者：Ling Liu,Jun Tian,Li Yi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：evacuating dense crowds, constrained time budget, highly dynamic environments, budget is essential, setting is critical

备注：

点击查看摘要

Abstract:4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.

23. 【2510.17651】Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

链接：https://arxiv.org/abs/2510.17651

作者：Sébastien Thuau,Siba Haidar,Ayush Bajracharya,Rachid Chelouah

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：convolutional neural network, frugal federated learning, examine frugal federated, complementary strategies, convolutional neural

备注： 7 pages, 1 figure, FLTA 2025

点击查看摘要

Abstract:We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

24. 【2510.17650】ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification

链接：https://arxiv.org/abs/2510.17650

作者：Athanasios Angelakis,Amne Mousa,Micah L. A. Heldeweg,Laurens A. Biesheuvel,Mark A. Haaksma,Jasper M. Smit,Pieter R. Tuinman,Paul W. G. Elbers

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Differentiating cardiogenic pulmonary, interstitial lung disease, structurally normal lungs, cardiogenic pulmonary oedema, remains challenging due

备注： 14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: [this https URL](https://github.com/Bluesman79/ZACH-ViT) Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License

点击查看摘要

Abstract:Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.

25. 【2510.17644】Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives

链接：https://arxiv.org/abs/2510.17644

作者：Zexian Huang,Mashnoon Islam,Brian Armstrong,Kourosh Khoshelham,Martin Tomko

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：walls hold significant, hold significant heritage, hold significant, Dry-stone walls hold, Dry-stone walls

备注：

点击查看摘要

Abstract:Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia's densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.

26. 【2510.17626】CaMiT: A Time-Aware Car Model Dataset for Classification and Generation

链接：https://arxiv.org/abs/2510.17626

作者：Frédéric LIN,Biruk Abere Ambaw,Adrian Popescu,Hejer Ammar,Romaric Audigier,Hervé Le Borgne(Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：object appearances change, systems must adapt, domains where object, object appearances, appearances change

备注： To be published in NeurIPS 2025 Track on Datasets and Benchmarks

点击查看摘要

Abstract:AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.

27. 【2510.17617】ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

链接：https://arxiv.org/abs/2510.17617

作者：Hendric Voss,Stefan Kopp

类目：Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词：manifold communicative functions, serve manifold communicative, expressive nonverbal cues, nonverbal cues, serve manifold

备注：

点击查看摘要

Abstract:Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: this https URL

28. 【2510.17611】One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection

链接：https://arxiv.org/abs/2510.17611

作者：Jia Guo,Shuai Lu,Lei Fan,Zelin Li,Donglin Di,Yang Song,Weihang Zhang,Wenbing Zhu,Hong Yan,Fang Chen,Huiqi Li,Hongen Liao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models significantly underperform, Unsupervised anomaly detection, Unsupervised anomaly, existing multi-class models, multi-class models significantly

备注： Extended version of CVPR2025

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the "less is more" philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2's full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.

29. 【2510.17609】Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation

链接：https://arxiv.org/abs/2510.17609

作者：Siqi Chen,Shanyue Guan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabled efficient, non-contact structural health, technology has enabled, UAV technology, Building Information Modeling

备注：

点击查看摘要

Abstract:The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.

30. 【2510.17603】ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

链接：https://arxiv.org/abs/2510.17603

作者：Shuyuan Zhang,Chenhan Jiang,Zuoou Li,Jiankang Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reduce expert manual, language offers significant, expert manual modeling, manual modeling efforts, offers significant potential

备注： NeurIPS 2025 Poster

点击查看摘要

Abstract:3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.

31. 【2510.17599】Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation

链接：https://arxiv.org/abs/2510.17599

作者：Hendric Voss,Lisa Michelle Bohnenkamp,Stefan Kopp

类目：Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词：semantically-augmented variant, resulting movements, study explores, evaluate their ability, ability to convey

备注：

点击查看摘要

Abstract:This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.

32. 【2510.17590】MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

链接：https://arxiv.org/abs/2510.17590

作者：Mir Nafis Sharear Shopnil,Sharad Duwal,Abhishek Tyagi,Adiba Mahbub Proma

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：overwhelming manual fact-checking, manual fact-checking capacity, daily multimodal posts, overwhelming manual, fact-checking capacity

备注： 16 pages, 3 tables, 1 figure

点击查看摘要

33. 【2510.17585】Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset

链接：https://arxiv.org/abs/2510.17585

作者：Chuhong Wang,Hua Li,Chongyi Li,Huazhong Liu,Xiongxin Tang,Sam Kwong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：camouflaged instance segmentation, underwater camouflaged instance, camouflaged instance, underwater vision tasks, instance segmentation

备注：

点击查看摘要

Abstract:With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model's limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model's ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.

34. 【2510.17568】PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception

链接：https://arxiv.org/abs/2510.17568

作者：Kaichen Zhou,Yuhan Wang,Grace Chen,Xinhai Chang,Gaspard Beaudouin,Fangneng Zhan,Paul Pu Liang,Mengyu Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Geometry Grounded Transformer, Visual Geometry Grounded, Grounded Transformer, shown strong capability, Visual Geometry

备注：

点击查看摘要

Abstract:Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

35. 【2510.17566】WP-CrackNet: A Collaborative Adversarial Learning Framework for End-to-End Weakly-Supervised Road Crack Detection

链接：https://arxiv.org/abs/2510.17566

作者：Nachuan Ma,Zhengfei Song,Qiang Hu,Xiaoyu Tang,Chengxi Zhang,Rui Fan,Lihua Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：intelligent infrastructure maintenance, Road crack detection, smart cities, essential for intelligent, intelligent infrastructure

备注：

点击查看摘要

Abstract:Road crack detection is essential for intelligent infrastructure maintenance in smart cities. To reduce reliance on costly pixel-level annotations, we propose WP-CrackNet, an end-to-end weakly-supervised method that trains with only image-level labels for pixel-wise crack detection. WP-CrackNet integrates three components: a classifier generating class activation maps (CAMs), a reconstructor measuring feature inferability, and a detector producing pixel-wise road crack detection results. During training, the classifier and reconstructor alternate in adversarial learning to encourage crack CAMs to cover complete crack regions, while the detector learns from pseudo labels derived from post-processed crack CAMs. This mutual feedback among the three components improves learning stability and detection accuracy. To further boost detection performance, we design a path-aware attention module (PAAM) that fuses high-level semantics from the classifier with low-level structural cues from the reconstructor by modeling spatial and channel-wise dependencies. Additionally, a center-enhanced CAM consistency module (CECCM) is proposed to refine crack CAMs using center Gaussian weighting and consistency constraints, enabling better pseudo-label generation. We create three image-level datasets and extensive experiments show that WP-CrackNet achieves comparable results to supervised methods and outperforms existing weakly-supervised methods, significantly advancing scalable road inspection. The source code package and datasets are available at this https URL.

36. 【2510.17529】MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation

链接：https://arxiv.org/abs/2510.17529

作者：Yovin Yahathugoda,Davide Prezzi,Piyalitt Ittichaiwong,Vicky Goh,Sebastien Ourselin,Michela Antonelli

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Active Surveillance, monitoring disease progression, intermediate-risk prostate cancer, aiming to avoid, clinical follow-up

备注：

点击查看摘要

Abstract:Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.

37. 【2510.17519】MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

链接：https://arxiv.org/abs/2510.17519

作者：Yongshun Zhang,Zhongyi Fan,Yonghang Zhang,Zhangzikang Li,Weifeng Chen,Zhongwei Feng,Chaoyue Wang,Peng Hou,Anxiang Zeng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：made remarkable progress, large-scale video generation, video generation, video generation models, visual content

备注： Technical Report; Project Page: \href{ [this https URL](https://github.com/Shopee-MUG/MUG-V) }

点击查看摘要

Abstract:In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{this https URL}{our webpage}.

38. 【2510.17501】Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

链接：https://arxiv.org/abs/2510.17501

作者：Yuanli Wu,Long Zhang,Yue Du,Bin Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：compressing long footage, social media, compressing long, surrogates is crucial, exploding across social

备注：

点击查看摘要

Abstract:With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific this http URL propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter this http URL three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.

39. 【2510.17484】Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment

链接：https://arxiv.org/abs/2510.17484

作者：Muhammad Umer Ramzan,Ali Zia,Abdelwahed Khamis,Noman Ali,Usman Ali,Wei Xiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Salient object detection, computer vision applications, segment visually prominent, visually prominent regions, Salient object

备注：

点击查看摘要

Abstract:Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.

40. 【2510.17482】SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries

链接：https://arxiv.org/abs/2510.17482

作者：Chenxu Dang,Haiyan Liu,Guangjun Bao,Pei An,Xinyue Tang,Jie Ma,Bingchuan Sun,Yan Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：rich spatial semantics, spatial semantics, capture rich spatial, occupancy world models, Semantic occupancy

备注： Under Review

点击查看摘要

Abstract:Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real this http URL this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at this https URL.

41. 【2510.17479】Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS

链接：https://arxiv.org/abs/2510.17479

作者：Feng Zhou,Wenkai Guo,Pu Cao,Zhicheng Zhang,Jianqin Yin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, leading to artifacts, artifacts like blurring, Splatting, training-time constraints

备注： A preprint paper

点击查看摘要

Abstract:Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emph{i.e.}, the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization's primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at this https URL.

42. 【2510.17440】Rethinking Nighttime Image Deraining via Learnable Color Space Transformation

链接：https://arxiv.org/abs/2510.17440

作者：Qiyuan Guan,Xiang Chen,Guiyue Jin,Jiyu Jin,Shumin Fan,Tianyu Song,Jinshan Pan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：poses significant challenges, significant challenges due, nighttime image deraining, deraining poses significant, daytime image deraining

备注： Accepted by NeurIPS 2025

点击查看摘要

Abstract:Compared to daytime image deraining, nighttime image deraining poses significant challenges due to inherent complexities of nighttime scenarios and the lack of high-quality datasets that accurately represent the coupling effect between rain and illumination. In this paper, we rethink the task of nighttime image deraining and contribute a new high-quality benchmark, HQ-NightRain, which offers higher harmony and realism compared to existing datasets. In addition, we develop an effective Color Space Transformation Network (CST-Net) for better removing complex rain from nighttime scenes. Specifically, we propose a learnable color space converter (CSC) to better facilitate rain removal in the Y channel, as nighttime rain is more pronounced in the Y channel compared to the RGB color space. To capture illumination information for guiding nighttime deraining, implicit illumination guidance is introduced enabling the learned features to improve the model's robustness in complex scenarios. Extensive experiments show the value of our dataset and the effectiveness of our method. The source code and datasets are available at this https URL.

43. 【2510.17439】From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

链接：https://arxiv.org/abs/2510.17439

作者：Zhengshen Zhang,Hao Li,Yalun Dai,Zhengbang Zhu,Lei Zhou,Chenchen Liu,Dong Wang,Francis E. H. Tay,Sijin Chen,Ziwei Liu,Yuxiao Liu,Xinghang Li,Pan Zhou

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：generalization and adaptability, typically built, gap that limits, limits generalization, spatial reasoning gap

备注： Project page: [this https URL](https://falcon-vla.github.io/)

点击查看摘要

Abstract:Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

44. 【2510.17434】Leveraging AV1 motion vectors for Fast and Dense Feature Matching

链接：https://arxiv.org/abs/2510.17434

作者：Julien Zouein,Hossein Javidnia,François Pitié,Anil Kokaram

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce dense sub-pixel, short tracks filtered, dense sub-pixel correspondences, motion vectors, cosine consistency

备注： Accepted ICIR 2025, camera-ready version

点击查看摘要

Abstract:We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

45. 【2510.17422】DeepDetect: Learning All-in-One Dense Keypoints

链接：https://arxiv.org/abs/2510.17422

作者：Shaharyar Ahmed Khan Tareen,Filza Khan Tareen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision tasks, including image registration, structure-from motion, vision tasks, computer vision

备注： 6 pages, 6 figures, 2 tables, 7 equations

点击查看摘要

Abstract:Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).

46. 【2510.17409】Monitoring Horses in Stalls: From Object to Event Detection

链接：https://arxiv.org/abs/2510.17409

作者：Dmitrii Galimzianov,Viacheslav Vyshegorodtsev,Ivan Nezhivykh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：labor-intensive and time-consuming, behavior of stalled, essential for early, issues but remains, remains labor-intensive

备注： 12 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera's blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.

47. 【2510.17394】MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

链接：https://arxiv.org/abs/2510.17394

作者：Alejandro Guerra-Manzanares,Farah E. Shamout

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse data sources, combine diverse data, multimodal, multimodal neural networks, data sources

备注： Accepted and presented at the 2025 International Joint Conference on Neural Networks (IJCNN'25). The paper was awarded an honorable mention (best 4 papers)

点击查看摘要

Abstract:The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.

48. 【2510.17384】Closed-Loop Transfer for Weakly-supervised Affordance Grounding

链接：https://arxiv.org/abs/2510.17384

作者：Jiajin Tang,Zhengxuan Wei,Ge Zheng,Sibei Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：perform previously unexperienced, previously unexperienced interactions, perform previously, previously unexperienced, simply by observing

备注： Accepted at ICCV 2025

点击查看摘要

Abstract:Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.

49. 【2510.17383】Latent Spaces Beyond Synthesis: From GANs to Diffusion Models

链接：https://arxiv.org/abs/2510.17383

作者：Ludovica Schaerf

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：generative visual models, Beatrice Fazi account, paper examines, examines the evolving, evolving nature

备注： Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI'25)

点击查看摘要

Abstract:This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi's account of synthesis as the amalgamation of distributed representations, we propose a distinction between "synthesis in a strict sense", where a compact latent space wholly determines the generative process, and "synthesis in a broad sense," which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.

50. 【2510.17373】Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing

链接：https://arxiv.org/abs/2510.17373

作者：Yintao Zhou,Wei Huang,Zhengyu Li,Jing Huang,Meng Pang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：adopting tailored interventions, early detecting potential, detecting potential patients, Parkinson disease, tailored interventions

备注： 3 pages, 2 figures, accepted by MIND 2025

点击查看摘要

Abstract:Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.

51. 【2510.17372】Beyond Real Faces: Synthetic Datasets Can Achieve Reliable Recognition Performance without Privacy Compromise

链接：https://arxiv.org/abs/2510.17372

作者：Paweł Borsukiewicz,Fadi Boutros,Iyiola E. Olatunji,Charles Beumier,Wendkûuni C. Ouedraogo,Jacques Klein,Tegawendé F. Bissyandé

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieving high accuracy, high accuracy requires, accuracy requires massive, potential legal liabilities, regulations like GDPR

备注：

点击查看摘要

Abstract:The deployment of facial recognition systems has created an ethical dilemma: achieving high accuracy requires massive datasets of real faces collected without consent, leading to dataset retractions and potential legal liabilities under regulations like GDPR. While synthetic facial data presents a promising privacy-preserving alternative, the field lacks comprehensive empirical evidence of its viability. This study addresses this critical gap through extensive evaluation of synthetic facial recognition datasets. We present a systematic literature review identifying 25 synthetic facial recognition datasets (2018-2025), combined with rigorous experimental validation. Our methodology examines seven key requirements for privacy-preserving synthetic data: identity leakage prevention, intra-class variability, identity separability, dataset scale, ethical data sourcing, bias mitigation, and benchmark reliability. Through experiments involving over 10 million synthetic samples, extended by a comparison of results reported on five standard benchmarks, we provide the first comprehensive empirical assessment of synthetic data's capability to replace real datasets. Best-performing synthetic datasets (VariFace, VIGFace) achieve recognition accuracies of 95.67% and 94.91% respectively, surpassing established real datasets including CASIA-WebFace (94.70%). While those images remain private, publicly available alternatives Vec2Face (93.52%) and CemiFace (93.22%) come close behind. Our findings reveal that they ensure proper intra-class variability while maintaining identity separability. Demographic bias analysis shows that, even though synthetic data inherits limited biases, it offers unprecedented control for bias mitigation through generation parameters. These results establish synthetic facial data as a scientifically viable and ethically imperative alternative for facial recognition research.

52. 【2510.17364】Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

链接：https://arxiv.org/abs/2510.17364

作者：Vaggelis Dorovatas,Soroush Seifi,Gunshi Gupta,Rahaf Aljundi

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large Language Models, Video Large Language, Large Language, Language Models, understanding videos in-context

备注： NeurIPS 2025

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

53. 【2510.17363】M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

链接：https://arxiv.org/abs/2510.17363

作者：U.V.B.L Udugama,George Vosselman,Francesco Nex

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：devices requires efficient, edge devices requires, requires efficient multi-task, Deploying real-time spatial, minimizing computational overhead

备注： Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

点击查看摘要

Abstract:Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

54. 【2510.17347】Exploring The Missing Semantics In Event Modality

链接：https://arxiv.org/abs/2510.17347

作者：Jingqian Wu,Shengpeng Xu,Yunbo Jia,Edmund Y. Lam

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high dynamic range, efficient motion capture, offer distinct advantages, Event cameras offer, semantic information

备注：

点击查看摘要

Abstract:Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.

55. 【2510.17338】Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition

链接：https://arxiv.org/abs/2510.17338

作者：Jiahao Huo,Mufhumudzi Muthivhi,Terence L. van Zyl,Fredrik Gustafsson

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：closed world setting, Wildlife classification models, Wildlife classification, world setting, closed world

备注：

点击查看摘要

Abstract:Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models' features and predicted logits. We propose a probability distribution based on an input's distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found this https URL.

56. 【2510.17332】DETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

链接：https://arxiv.org/abs/2510.17332

作者：Zhaoran Zhao,Xinli Yue,Jianhui Sun,Yuhao Xie,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：human-aligned evaluation paradigms, scalar quality prediction, Image Quality Assessment, human-aligned evaluation, evaluation paradigms

备注： Accepted to ICCV 2025 Workshop

点击查看摘要

Abstract:Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.

57. 【2510.17330】CharDiff: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration

链接：https://arxiv.org/abs/2510.17330

作者：Gyuhwan Park,Kihyun Na,Injung Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：license plate images, including increasing evidential, license plate, License Plate Recognition, plate images

备注： 11 pages, 6 figures

点击查看摘要

Abstract:The significance of license plate image restoration goes beyond the preprocessing stage of License Plate Recognition (LPR) systems, as it also serves various purposes, including increasing evidential value, enhancing the clarity of visual interface, and facilitating further utilization of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character's guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff significantly outperformed the baseline restoration models in both restoration quality and recognition accuracy, achieving a 28% relative reduction in CER on the Roboflow-LP dataset, compared to the best-performing baseline model. These results indicate that the structured character-guided conditioning effectively enhances the robustness of diffusion-based license plate restoration and recognition in practical deployment scenarios.

58. 【2510.17322】A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World

链接：https://arxiv.org/abs/2510.17322

作者：Wei Zhang,Zhanhao Hu,Xiao Li,Xiaopei Zhu,Xiaolin Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep learning-based object, defense methods, learning-based object detectors, adversarial, recent years

备注： 13 pages, 8 figures

点击查看摘要

Abstract:In recent years, adversarial attacks against deep learning-based object detectors in the physical world have attracted much attention. To defend against these attacks, researchers have proposed various defense methods against adversarial patches, a typical form of physically-realizable attack. However, our experiments showed that simply enlarging the patch size could make these defense methods fail. Motivated by this, we evaluated various defense methods against adversarial clothes which have large coverage over the human body. Adversarial clothes provide a good test case for adversarial defense against patch-based attacks because they not only have large sizes but also look more natural than a large patch on humans. Experiments show that all the defense methods had poor performance against adversarial clothes in both the digital world and the physical world. In addition, we crafted a single set of clothes that broke multiple defense methods on Faster R-CNN. The set achieved an Attack Success Rate (ASR) of 96.06% against the undefended detector and over 64.84% ASRs against nine defended models in the physical world, unveiling the common vulnerability of existing adversarial defense methods against adversarial clothes. Code is available at: this https URL.

59. 【2510.17318】CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference

链接：https://arxiv.org/abs/2510.17318

作者：Sangyoon Bae,Jiook Cha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hemodynamically distorted BOLD, distorted BOLD signals, Dynamic Causal Modeling, inferring neural causality, Conditional Mamba architecture

备注：

点击查看摘要

Abstract:We introduce CausalMamba, a scalable framework that addresses fundamental limitations in fMRI-based causal inference: the ill-posed nature of inferring neural causality from hemodynamically distorted BOLD signals and the computational intractability of existing methods like Dynamic Causal Modeling (DCM). Our approach decomposes this complex inverse problem into two tractable stages: BOLD deconvolution to recover latent neural activity, followed by causal graph inference using a novel Conditional Mamba architecture. On simulated data, CausalMamba achieves 37% higher accuracy than DCM. Critically, when applied to real task fMRI data, our method recovers well-established neural pathways with 88% fidelity, whereas conventional approaches fail to identify these canonical circuits in over 99% of subjects. Furthermore, our network analysis of working memory data reveals that the brain strategically shifts its primary causal hub-recruiting executive or salience networks depending on the stimulus-a sophisticated reconfiguration that remains undetected by traditional methods. This work provides neuroscientists with a practical tool for large-scale causal inference that captures both fundamental circuit motifs and flexible network dynamics underlying cognitive function.

60. 【2510.17305】LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

链接：https://arxiv.org/abs/2510.17305

作者：ZhaoYang Han,Qihan Lin,Hao Liang,Bowen Chen,Zhou Liu,Wentao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：understand long videos, assess models' ability, Challenging Task Scenarios, rich language elements, textbf

备注： Submitted to ARR Rolling Review

点击查看摘要

Abstract:We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at this https URL.

61. 【2510.17299】Exploring Structural Degradation in Dense Representations for Self-supervised Learning

链接：https://arxiv.org/abs/2510.17299

作者：Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantic segmentation, Self-supervised Dense Degradation, self-supervised learning, dense prediction tasks, observe a counterintuitive

备注： Accepted by NeurIPS 2025

点击查看摘要

Abstract:In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0\%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at this https URL.

62. 【2510.17287】Machine Vision-Based Surgical Lighting System:Design and Implementation

链接：https://arxiv.org/abs/2510.17287

作者：Amir Gharghabi,Mahdi Hakiminezhad,Maryam Shafaei,Shaghayegh Gharghabi

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Effortless and ergonomically, ergonomically designed surgical, safety during procedures, ergonomically designed, critical for precision

备注：

点击查看摘要

Abstract:Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.

63. 【2510.17278】SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation

链接：https://arxiv.org/abs/2510.17278

作者：Mehdi Zekriyapanah Gashti,Mostafa Mohammadpour,Ghasem Farjamnia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：white blood cells, remain challenging due, Deep Feature Fusion, Accurate segmentation, blood cells

备注：

点击查看摘要

Abstract:Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.

64. 【2510.17274】Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models

链接：https://arxiv.org/abs/2510.17274

作者：Katie Luo,Jingwei Ji,Tong He,Runsheng Xu,Yichen Xie,Dragomir Anguelov,Mingxing Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current autonomous driving, autonomous driving systems, driving systems rely, Current autonomous, demonstrate reliable performance

备注： In proceedings of IROS 2025

点击查看摘要

Abstract:Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

65. 【2510.17269】FineVision: Open Data Is All You Need

链接：https://arxiv.org/abs/2510.17269

作者：Luis Wiedmann,Orr Zohar,Amir Mahla,Xiaohan Wang,Rui Li,Thibaud Frere,Leandro von Werra,Aritra Roy Gosthipaty,Andrés Marafioti

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：contaminated public datasets, advancement of vision-language, fragmented landscape, landscape of inconsistent, inconsistent and contaminated

备注：

点击查看摘要

Abstract:The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

66. 【2510.17264】Fair and Interpretable Deepfake Detection in Videos

链接：https://arxiv.org/abs/2510.17264

作者：Akihito Yoshii,Ryosuke Sonoda,Ramya Srinivasan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Existing deepfake detection, capture temporal information, Existing deepfake, lack transparency, leading to biased

备注： 10 pages (including References)

点击查看摘要

Abstract:Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.

67. 【2510.17247】From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

链接：https://arxiv.org/abs/2510.17247

作者：Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, significantly enhanced, Recent, video generation, reward models trained

备注：

点击查看摘要

68. 【2510.17234】aming Modality Entanglement in Continual Audio-Visual Segmentation

链接：https://arxiv.org/abs/2510.17234

作者：Yuyang Hong,Qi Yang,Tao Zhang,Zili Wang,Zhaojin Fu,Kun Ding,Bin Fan,Shiming Xiang

类目：Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：continual learning, significant progress, continual learning settings, preserving performance, performance on previously

备注：

点击查看摘要

Abstract:Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.

69. 【2510.17218】When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

链接：https://arxiv.org/abs/2510.17218

作者：Zhuo Cao,Heming Du,Bingqing Zhang,Xin Yu,Xue Li,Sen Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Existing Moment retrieval, focus on Single-Moment, QV-M, Single-Moment Retrieval, Moment retrieval

备注： Accepted to NeurIPS 2025

点击查看摘要

Abstract:Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at this https URL.

70. 【2510.17205】$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

链接：https://arxiv.org/abs/2510.17205

作者：Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Large Language Models, Large Language, achieved strong performance, significant computational overhead

备注： EMNLP 2025 Main

点击查看摘要

71. 【2510.17201】Optimizing DINOv2 with Registers for Face Anti-Spoofing

链接：https://arxiv.org/abs/2510.17201

作者：Mika Feng,Pierre Gallin-Martel,Koichi Ito,Takafumi Aoki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：head pose, blur during capture, robust against variations, variations in head, Face recognition systems

备注： ICCV 2025 Workshop FAS

点击查看摘要

Abstract:Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.

72. 【2510.17200】EndoCIL: A Class-Incremental Learning Framework for Endoscopic Image Classification

链接：https://arxiv.org/abs/2510.17200

作者：Bingrong Liu,Jun Shi,Yushan Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Class-incremental learning, real-world clinical applications, evolving clinical data, endoscopic image analysis, analysis is crucial

备注：

点击查看摘要

Abstract:Class-incremental learning (CIL) for endoscopic image analysis is crucial for real-world clinical applications, where diagnostic models should continuously adapt to evolving clinical data while retaining performance on previously learned ones. However, existing replay-based CIL methods fail to effectively mitigate catastrophic forgetting due to severe domain discrepancies and class imbalance inherent in endoscopic imaging. To tackle these challenges, we propose EndoCIL, a novel and unified CIL framework specifically tailored for endoscopic image diagnosis. EndoCIL incorporates three key components: Maximum Mean Discrepancy Based Replay (MDBR), employing a distribution-aligned greedy strategy to select diverse and representative exemplars, Prior Regularized Class Balanced Loss (PRCBL), designed to alleviate both inter-phase and intra-phase class imbalance by integrating prior class distributions and balance weights into the loss function, and Calibration of Fully-Connected Gradients (CFG), which adjusts the classifier gradients to mitigate bias toward new classes. Extensive experiments conducted on four public endoscopic datasets demonstrate that EndoCIL generally outperforms state-of-the-art CIL methods across varying buffer sizes and evaluation metrics. The proposed framework effectively balances stability and plasticity in lifelong endoscopic diagnosis, showing promising potential for clinical scalability and deployment.

73. 【2510.17199】Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis

链接：https://arxiv.org/abs/2510.17199

作者：Nirai Hayakawa,Kazumasa Shimari,Kazuma Yamasaki,Hirotatsu Hoshikawa,Rikuto Tsuchida,Kenichi Matsumoto

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：match log data, FPS game VALORANT, actively conducted, log data, data and statistical

备注： Accepted to IEEE 2025 Conference on Games

点击查看摘要

Abstract:Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.

74. 【2510.17198】From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh

链接：https://arxiv.org/abs/2510.17198

作者：M Saifuzzaman Rafat,Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Jungpil Shin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：arteries of commerce, commerce and sustenance, relentless destruction, agents of relentless, Mokterer Char Union

备注： Submitted to the International Conference on Data and Applied Analytics (IDAA 2025). 15 pages, 5 figures, 4 tables

点击查看摘要

Abstract:The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh's most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM's mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.

75. 【2510.17197】ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

链接：https://arxiv.org/abs/2510.17197

作者：Pu Zhang,Yuwei Li,Xingyuan Xian,Guoming Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：increasingly large inputs, process increasingly large, unlike in LLMs, large inputs, generates significant visual

备注：

点击查看摘要

Abstract:As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

76. 【2510.17188】HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery

链接：https://arxiv.org/abs/2510.17188

作者：Vaibhav Rathore,Divyam Gupta,Biplab Banerjee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generalized Category Discovery, Generalized Category, Category Discovery, classify test-time samples, aims to classify

备注： Accpeted at NeurIPS (2025) Main Conference

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** -- available during training -- or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss -- combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion -- **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.

77. 【2510.17181】Capturing Head Avatar with Hand Contacts from a Monocular Video

链接：https://arxiv.org/abs/2510.17181

作者：Haonan He,Yufeng Zheng,Jie Song

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vital for telepresence, Photorealistic, head avatars, ignoring natural hand-face, hand-face interactions

备注： ICCV 2025

点击查看摘要

Abstract:Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.

Comments:
ICCV 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.17181 [cs.CV]

(or
arXiv:2510.17181v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.17181

Focus to learn more

              arXiv-issued DOI via DataCite</p>

78. 【2510.17179】Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring

链接：https://arxiv.org/abs/2510.17179

作者：Yingzi Han,Jiakai He,Chuanlong Xie,Jianping Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：models face significant, face significant challenges, real-world deployment due, recognition models face, Automated plankton recognition

备注：

点击查看摘要

Abstract:Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton's complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite{875n-f104-21}, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \cite{wang2022vim} method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at this https URL.

79. 【2510.17171】Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

链接：https://arxiv.org/abs/2510.17171

作者：Feihong Yan,Peiru Wang,Yao Zhu,Kaiyu Pang,Qingyan Wei,Huiqi Li,Linfeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Masked Autoregressive, spatially correlated visual, potential remains constrained, correlated visual tokens, acceleration potential remains

备注： 12 pages, 6 figures

点击查看摘要

Abstract:Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in this https URL.

80. 【2510.17169】Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition

链接：https://arxiv.org/abs/2510.17169

作者：Roland Croft,Brian Du,Darcy Joseph,Sharath Kumar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：exposing blind spots, protecting user privacy, subtly alter benign, benign facial images, alter benign facial

备注： Accepted for publication in DICTA 2025

点击查看摘要

Abstract:Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.

81. 【2510.17157】GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image

链接：https://arxiv.org/abs/2510.17157

作者：Yinghui Wang,Xinyu Zhang,Peng Du

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：industrial concept design, holds great potential, single image holds, image holds great, Generating editable

备注：

点击查看摘要

Abstract:Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.

82. 【2510.17148】DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment

链接：https://arxiv.org/abs/2510.17148

作者：Yu Gao,Yiru Wang,Anqing Jiang,Heng Yuwen,Wang Shuo,Sun Hao,Wang Jijun

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：long-tail scenarios due, essential world knowledge, physically plausible trajectories, surrounding environments, world knowledge

备注：

点击查看摘要

Abstract:Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.

83. 【2510.17137】KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation

链接：https://arxiv.org/abs/2510.17137

作者：WenBo Xu,Liu Liu,Li Zhang,Ran Zhang,Hao Wu,Dan Guo,Meng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：introduce structural diversity, exhibit significant challenges, Articulated Object Shape, pose estimation due, Object Shape Reconstruction

备注：

点击查看摘要

Abstract:Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.

84. 【2510.17131】GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

链接：https://arxiv.org/abs/2510.17131

作者：Xin Gao,Jiyao Liu,Guanghao Li,Yueming Lyu,Jianxiong Gao,Weichen Yu,Ningsheng Xu,Liang Wang,Caifeng Shan,Ziwei Liu,Chenyang Si

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent advancements, advancements have explored, models for synthesizing, OOD, Recent

备注： 28 pages, 16 figures, conference

点击查看摘要

Abstract:Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier's latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.

85. 【2510.17120】Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation

链接：https://arxiv.org/abs/2510.17120

作者：Rishi Sonthalia,Raj Rao Nadakuditi

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词：regularization scheme, matricial free energy, free energy, free, matricial free

备注：

点击查看摘要

Abstract:We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.

86. 【2510.17114】owards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras

链接：https://arxiv.org/abs/2510.17114

作者：Hodaka Kawachi,Tomoya Nakamura,Hiroaki Santo,SaiKiran Kumar Tedla,Trevor Dalton Canham,Yasushi Yagi,Michael S. Brown

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：LED-based environmental lighting, produce visually imperceptible, visually imperceptible watermarks, paper introduces, LED-based environmental

备注：

点击查看摘要

Abstract:This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source's spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system's sensitivity to visible spectra, modern consumer camera sensors' spectral sensitivity, and narrowband LEDs' ability to generate broadband spectra perceived as "white light" (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.

87. 【2510.17105】Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement

链接：https://arxiv.org/abs/2510.17105

作者：Xiaogang Xu,Jian Wang,Yunfan Lu,Ruihang Chu,Ruixing Wang,Jiafei Wu,Bei Yu,Liang Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low-level vision tasks, leveraging pre-trained large, achieved remarkable performance, vision tasks, Stable Diffusion

备注：

点击查看摘要

Abstract:Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.

88. 【2510.17101】Shape-aware Inertial Poser: Motion Tracking for Humans with Diverse Shapes Using Sparse Inertial Sensors

链接：https://arxiv.org/abs/2510.17101

作者：Lu Yin,Ziying Shi,Yinghao Wu,Xinyu Yi,Feng Xu,Shihui Guo

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：significant attention recently, gained significant attention, Human motion capture, Human motion, body

备注： Accepted by SIGGRAPH Asia 2025 (TOG)

点击查看摘要

Abstract:Human motion capture with sparse inertial sensors has gained significant attention recently. However, existing methods almost exclusively rely on a template adult body shape to model the training data, which poses challenges when generalizing to individuals with largely different body shapes (such as a child). This is primarily due to the variation in IMU-measured acceleration caused by changes in body shape. To fill this gap, we propose Shape-aware Inertial Poser (SAIP), the first solution considering body shape differences in sparse inertial-based motion capture. Specifically, we decompose the sensor measurements related to shape and pose in order to effectively model their joint correlations. Firstly, we train a regression model to transfer the IMU-measured accelerations of a real body to match the template adult body model, compensating for the shape-related sensor measurements. Then, we can easily follow the state-of-the-art methods to estimate the full body motions of the template-shaped body. Finally, we utilize a second regression model to map the joint velocities back to the real body, combined with a shape-aware physical optimization strategy to calculate global motions on the subject. Furthermore, our method relies on body shape awareness, introducing the first inertial shape estimation scheme. This is accomplished by modeling the shape-conditioned IMU-pose correlation using an MLP-based network. To validate the effectiveness of SAIP, we also present the first IMU motion capture dataset containing individuals of different body sizes. This dataset features 10 children and 10 adults, with heights ranging from 110 cm to 190 cm, and a total of 400 minutes of paired IMU-Motion samples. Extensive experimental results demonstrate that SAIP can effectively handle motion capture tasks for diverse body shapes. The code and dataset are available at this https URL.

89. 【2510.17095】GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation

链接：https://arxiv.org/abs/2510.17095

作者：Ruitong Gan,Junran Peng,Yang Liu,Chuanchen Luo,Qing Li,Zhaoxiang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：urban streets, fundamental primitives, man-made environments, indoor spaces, spaces and urban

备注：

点击查看摘要

Abstract:Planes are fundamental primitives of 3D sences, especially in man-made environments such as indoor spaces and urban streets. Representing these planes in a structured and parameterized format facilitates scene editing and physical simulations in downstream applications. Recently, Gaussian Splatting (GS) has demonstrated remarkable effectiveness in the Novel View Synthesis task, with extensions showing great potential in accurate surface reconstruction. However, even state-of-the-art GS representations often struggle to reconstruct planar regions with sufficient smoothness and precision. To address this issue, we propose GSPlane, which recovers accurate geometry and produces clean and well-structured mesh connectivity for plane regions in the reconstructed scene. By leveraging off-the-shelf segmentation and normal prediction models, GSPlane extracts robust planar priors to establish structured representations for planar Gaussian coordinates, which help guide the training process by enforcing geometric consistency. To further enhance training robustness, a Dynamic Gaussian Re-classifier is introduced to adaptively reclassify planar Gaussians with persistently high gradients as non-planar, ensuring more reliable optimization. Furthermore, we utilize the optimized planar priors to refine the mesh layouts, significantly improving topological structure while reducing the number of vertices and faces. We also explore applications of the structured planar representation, which enable decoupling and flexible manipulation of objects on supportive planes. Extensive experiments demonstrate that, with no sacrifice in rendering quality, the introduction of planar priors significantly improves the geometric accuracy of the extracted meshes across various baselines.

90. 【2510.17078】owards a Generalizable Fusion Architecture for Multimodal Object Detection

链接：https://arxiv.org/abs/2510.17078

作者：Jad Berjawi,Yoann Dupas,Christophe C'erin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multiple sensor modalities, leveraging complementary cues, Modal Cross Attention, introduce Filtered Multi, robustness in chal

备注： 8 pages, 8 figures, accepted at ICCV 2025 MIRA Workshop

点击查看摘要

Abstract:Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.

91. 【2510.17068】ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding

链接：https://arxiv.org/abs/2510.17068

作者：Zhe Luo,Wenjing Jia,Stuart Perry

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demanding real-time processing, augmented reality, autonomous driving, immersive communication, demanding real-time

备注：

点击查看摘要

Abstract:Three-dimensional (3D) point clouds are becoming increasingly vital in applications such as autonomous driving, augmented reality, and immersive communication, demanding real-time processing and low latency. However, their large data volumes and bandwidth constraints hinder the deployment of high-quality services in resource-limited environments. Progres- sive coding, which allows for decoding at varying levels of detail, provides an alternative by allowing initial partial decoding with subsequent refinement. Although recent learning-based point cloud geometry coding methods have achieved notable success, their fixed latent representation does not support progressive decoding. To bridge this gap, we propose ProDAT, a novel density-aware tail-drop mechanism for progressive point cloud coding. By leveraging density information as a guidance signal, latent features and coordinates are decoded adaptively based on their significance, therefore achieving progressive decoding at multiple bitrates using one single model. Experimental results on benchmark datasets show that the proposed ProDAT not only enables progressive coding but also achieves superior coding efficiency compared to state-of-the-art learning-based coding techniques, with over 28.6% BD-rate improvement for PSNR- D2 on SemanticKITTI and over 18.15% for ShapeNet

92. 【2510.17051】How Universal Are SAM2 Features?

链接：https://arxiv.org/abs/2510.17051

作者：Masoud Khairi Atani,Alon Harell,Hyomin Choi,Runyu Yang,Fabien Racape,Ivan V. Bajic

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：foundation vision models, fully understood, counterparts is critical, vision models, general-purpose foundation vision

备注： This work has been accepted for publication in IEEE Picture Coding Symposium (PCS) 2025

点击查看摘要

Abstract:The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.

93. 【2510.17045】Video Reasoning without Training

链接：https://arxiv.org/abs/2510.17045

作者：Deepak Sridhar,Kartikeya Bhardwaj,Jeya Pradha Jeyaraj,Nuno Vasconcelos,Ankita Nayak,Harris Teague

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Multimodal Models, Large Multimodal, costly reinforcement learning, substantial computational overhead, Multimodal Models

备注：

点击查看摘要

Abstract:Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

94. 【2510.17043】Person Re-Identification via Generalized Class Prototypes

链接：https://arxiv.org/abs/2510.17043

作者：Md Ahmed Al Muzaddid,William J. Beksi

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Advanced feature extraction, Advanced feature, feature extraction methods, feature extraction, significantly contributed

备注： 18 pages, 11 figures, and 4 tables

点击查看摘要

Abstract:Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results

95. 【2510.17039】Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework

链接：https://arxiv.org/abs/2510.17039

作者：Mohammad R. Salmanpour,Sonya Falahati,Amir Hossein Pouria,Amin Mousavi,Somayeh Sadat Mehrnia,Morteza Alizadeh,Arman Gorji,Zeinab Farsangi,Alireza Safarian,Mehdi Maghsudi,Carlos Uribe,Arman Rahmim,Ren Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：central to screening, remains the leading, imaging central, Lung cancer remains, cancer mortality

备注： 13 pages, 2 figures, and 2 tables

点击查看摘要

Abstract:Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.

96. 【2510.17038】DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation

链接：https://arxiv.org/abs/2510.17038

作者：Pedram Fekri,Majid Roshanfar,Samuel Barbeau,Seyedfarzad Famouri,Thomas Looi,Dale Podolsky,Mehrdad Zadeh,Javad Dargahi

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Cardiac catheterization remains, minimally invasive interventions, Cardiac catheterization, invasive interventions, manual operation

备注：

点击查看摘要

Abstract:Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.

97. 【2510.17035】Conditional Synthetic Live and Spoof Fingerprint Generation

链接：https://arxiv.org/abs/2510.17035

作者：Syed Konain Abbas,Sandip Purnapatra,M. G. Sarwar Murshed,Conor Miller-Lynch,Lambert Igene,Soumyabrata Dey,Stephanie Schuckers,Faraz Hussain

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large fingerprint datasets, strict privacy measures, require strict privacy, Large fingerprint, training and evaluation

备注：

点击查看摘要

Abstract:Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fréchet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.

98. 【2510.17034】Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding

链接：https://arxiv.org/abs/2510.17034

作者：Yutong Zhong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：garnered considerable interest, advancing spatial reasoning, complex environments, garnered considerable, considerable interest

备注：

点击查看摘要

Abstract:Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe "2D semantic bias" that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.

99. 【2510.17023】Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

链接：https://arxiv.org/abs/2510.17023

作者：Shraman Pramanick,Effrosyni Mavroudi,Yale Song,Rama Chellappa,Lorenzo Torresani,Triantafyllos Afouras

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：utilizing multi-modal large, introduce ED-VTG, multi-modal large language, grounding utilizing multi-modal, utilizing multi-modal

备注： ICCV 2025 (Highlights)

点击查看摘要

Abstract:We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

100. 【2510.17014】Do Satellite Tasks Need Special Pretraining?

链接：https://arxiv.org/abs/2510.17014

作者：Ani Vanyan,Alvard Barseghyan,Hakob Tamazyan,Tigran Galstyan,Vahan Huroyan,Naira Hovakimyan,Hrant Khachatrian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advanced machine learning, Foundation models, advanced machine, machine learning, remote sensing

备注：

点击查看摘要

Abstract:Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.

101. 【2510.17007】An empirical study of the effect of video encoders on Temporal Video Grounding

链接：https://arxiv.org/abs/2510.17007

作者：Ignacio M. De la Jara,Cristian Rodriguez-Opazo,Edison Marrese-Taylor,Felipe Bravo-Marquez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language query, Temporal video grounding, computer vision, aiming to localize, localize a natural

备注：

点击查看摘要

Abstract:Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

102. 【2510.16989】raining-free Online Video Step Grounding

链接：https://arxiv.org/abs/2510.16989

作者：Luca Zanella,Massimiliano Mancini,Yiming Wang,Alessio Tonioni,Elisa Ricci

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：aims to detect, Large Multimodal Models, Large Multimodal, Multimodal Models, Video

备注： NeurIPS 2025. Project website at [this https URL](https://lucazanella.github.io/baglm/)

点击查看摘要

Abstract:Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

103. 【2510.16988】CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams

链接：https://arxiv.org/abs/2510.16988

作者：Junhao Zhao,Zishuai Liu,Ruili Fang,Jin Lu,Linghan Zhang,Fei Dou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Ambient Assisted Living, Daily Living, Assisted Living, existing methods remain, methods remain constrained

备注：

点击查看摘要

Abstract:The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.

104. 【2510.16983】One-step Diffusion Models with Bregman Density Ratio Matching

链接：https://arxiv.org/abs/2510.16983

作者：Yuanzhi Zhu,Eleftherios Tsonis,Lucas Degeorge,Vicky Kalogeiton

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：slow multi-step sampling, remain computationally expensive, computationally expensive due, high generative quality, multi-step sampling

备注： work in progress

点击查看摘要

Abstract:Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.

105. 【2510.16973】Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis

链接：https://arxiv.org/abs/2510.16973

作者：Praveenbalaji Rajendran,Mojtaba Safari,Wenfeng He,Mingzhe Hu,Shansong Wang,Jun Zhou,Xiaofeng Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

关键词：Recent advancements, diverse medical imaging, revolutionized medical image, medical image analysis, artificial intelligence

备注：

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.

106. 【2510.16948】Unlocking Off-the-Grid Sparse Recovery with Unlimited Sensing: Simultaneous Super-Resolution in Time and Amplitude

链接：https://arxiv.org/abs/2510.16948

作者：Ruiming Guo,Ayush Bhandari

类目：Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：Dirac impulses, signal processing, classical problem, problem in signal, filtered measurements

备注： 28 Pages, 10 figures. To appear in IEEE Journal of Selected Topics in Signal Processing

点击查看摘要

Abstract:The recovery of Dirac impulses, or spikes, from filtered measurements is a classical problem in signal processing. As the spikes lie in the continuous domain while measurements are discrete, this task is known as super-resolution or off-the-grid sparse recovery. Despite significant theoretical and algorithmic advances over the past decade, these developments often overlook critical challenges at the analog-digital interface. In particular, when spikes exhibit strong-weak amplitude disparity, conventional digital acquisition may result in clipping of strong components or loss of weak ones beneath the quantization noise floor. This motivates a broader perspective: super-resolution must simultaneously resolve both amplitude and temporal structure. Under a fixed bit budget, such information loss is unavoidable. In contrast, the emerging theory and practice of the Unlimited Sensing Framework (USF) demonstrate that these fundamental limitations can be overcome. Building on this foundation, we demonstrate that modulo encoding within USF enables digital super-resolution by enhancing measurement precision, thereby unlocking temporal super-resolution beyond conventional limits. We develop new theoretical results that extend to non-bandlimited kernels commonly encountered in practice and introduce a robust algorithm for off-the-grid sparse recovery. To demonstrate practical impact, we instantiate our framework in the context of time-of-flight imaging. Both numerical simulations and hardware experiments validate the effectiveness of our approach under low-bit quantization, enabling super-resolution in amplitude and time.

107. 【2510.16926】Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

链接：https://arxiv.org/abs/2510.16926

作者：Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注： 23 pages,19 figures

点击查看摘要

108. 【2510.16914】Domain Generalizable Continual Learning

链接：https://arxiv.org/abs/2510.16914

作者：Hongwei Yan,Guanglong Sun,Zhiqi Kang,Yi Zhong,Liyuan Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic real-world environments, unseen scenarios, real-world environments, intelligent systems, adapt effectively

备注： 25 pages

点击查看摘要

Abstract:To adapt effectively to dynamic real-world environments, intelligent systems must continually acquire new skills while generalizing them to diverse, unseen scenarios. Here, we introduce a novel and realistic setting named domain generalizable continual learning (DGCL): a model learns sequential tasks with each involving a single domain, aiming to perform well across all encountered tasks and domains. This setting poses unique challenges in acquiring, retaining, and leveraging both semantic- and domain-relevant information for robust generalization. Although state-of-the-art continual learning (CL) methods have employed pre-trained models (PTMs) to enhance task-specific generalization, they typically assume identical training and testing domains for each task and therefore perform poorly in DGCL. To this end, we propose adaptive Domain Transformation (DoT), an innovative PTMs-based approach tailored to DGCL. Inspired by the distributed-plus-hub theory of the human brain, DoT disentangles semantic- and domain-relevant information in representation learning, and adaptively transforms task representations across various domains for output alignment, ensuring balanced and generalized predictions. DoT serves as a plug-in strategy that greatly facilitates state-of-the-art CL baselines under both full parameter tuning and parameter-efficient tuning paradigms in DGCL, validated by extensive experiments. Also, DoT is shown to accumulate domain-generalizable knowledge from DGCL, and ensure resource efficiency with a lightweight implementation.

109. 【2510.16913】Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation

链接：https://arxiv.org/abs/2510.16913

作者：Akhila Kambhatla,Ahmed R Khaled

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：RGB-based systems fail, visually obscured conditions, Thermal weapon segmentation, systems fail, lowlight and visually

备注： 9 Images with 1 figure and 3 Tables. This is a preprint submitted to arXiv

点击查看摘要

Abstract:Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3\+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15\%) and Pixel Accuracy (97.04\%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84\%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24\% mIoU, and DeepLabV3\+ R101-D8 reaches 92.76\% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.

110. 【2510.16891】Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data

链接：https://arxiv.org/abs/2510.16891

作者：Ramon Dalmau,Gabriel Jarry,Philippe Very

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：climate impact, significant contributor, contrails, Aviation, temporal resolution

备注：

点击查看摘要

Abstract:Aviation's non-CO2 effects, particularly contrails, are a significant contributor to its climate impact. Persistent contrails can evolve into cirrus-like clouds that trap outgoing infrared radiation, with radiative forcing potentially comparable to or exceeding that of aviation's CO2 emissions. While physical models simulate contrail formation, evolution and dissipation, validating and calibrating these models requires linking observed contrails to the flights that generated them, a process known as contrail-to-flight attribution. Satellite-based attribution is challenging due to limited spatial and temporal resolution, as contrails often drift and deform before detection. In this paper, we evaluate an alternative approach using ground-based cameras, which capture contrails shortly after formation at high spatial and temporal resolution, when they remain thin, linear, and visually distinct. Leveraging the ground visible camera contrail sequences (GVCCS) dataset, we introduce a modular framework for attributing contrails observed using ground-based cameras to theoretical contrails derived from aircraft surveillance and meteorological data. The framework accommodates multiple geometric representations and distance metrics, incorporates temporal smoothing, and enables flexible probability-based assignment strategies. This work establishes a strong baseline and provides a modular framework for future research in linking contrails to their source flight.

111. 【2510.16888】Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

链接：https://arxiv.org/abs/2510.16888

作者：Zongjian Li,Zheyuan Liu,Qihui Zhang,Bin Lin,Shenghai Yuan,Zhiyuan Yan,Yang Ye,Wangbo Yu,Yuwei Niu,Li Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, Instruction-based image editing, Instruction-based image, Diffusion Negative-aware Finetuning, remarkable progress

备注：

点击查看摘要

Abstract:Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at this https URL.

112. 【2510.16887】Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis

链接：https://arxiv.org/abs/2510.16887

作者：Nusrat Munia,Abdullah Imran

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated remarkable capability, high-quality synthetic data, generating high-quality synthetic, Generative models, including medical images

备注： EMBC 2025

点击查看摘要

Abstract:Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at this https URL.

113. 【2510.16877】Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning

链接：https://arxiv.org/abs/2510.16877

作者：Heming Zou,Yunliang Zang,Wutong Xu,Xiangyang Ji

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：mitigate catastrophic forgetting, continual representation learning, representation learning paradigm, learning paradigm reframes, paradigm reframes parameter

备注：

点击查看摘要

Abstract:Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL's effectiveness in addressing this challenge through a biologically inspired design. Code is available at this https URL.

114. 【2510.16870】Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding

链接：https://arxiv.org/abs/2510.16870

作者：Yudan Ren,Xinlong Wang,Kexin Wang,Tian Xia,Zihan Ma,Zhaowei Li,Xiangrong Bi,Xiao Li,Xiaowei He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：research primarily focuses, unimodal ANN studies, high-level model outputs, ANN studies fail, ANN research primarily

备注： 14 pages, 7 figures

点击查看摘要

Abstract:While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain's inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain's fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP's independent branches show modality-specific specialization, whereas METER's cross-modal design yields unified cross-modal activation, highlighting the architecture's influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.

115. 【2510.16865】Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection

链接：https://arxiv.org/abs/2510.16865

作者：Yuyang Yu,Zhengwei Chen,Xuemiao Xu,Lei Zhang,Haoxin Yang,Yongwei Nie,Shengfeng He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：industrial quality control, identify structural defects, quality control, aiming to identify, high reliability

备注：

点击查看摘要

Abstract:3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability. However, current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity, particularly in capturing local geometric details and achieving rotation invariance. These limitations become more pronounced when registration fails, leading to unreliable detection results. We argue that point-cloud registration plays an essential role not only in aligning geometric structures but also in guiding feature extraction toward rotation-invariant and locally discriminative representations. To this end, we propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection. Our key insight is that both tasks rely on modeling local geometric structures and leveraging feature similarity across samples. By embedding feature extraction into the registration learning process, our framework jointly optimizes alignment and representation learning. This integration enables the network to acquire features that are both robust to rotations and highly effective for anomaly detection. Extensive experiments on the Anomaly-ShapeNet and Real3D-AD datasets demonstrate that our method consistently outperforms existing approaches in effectiveness and generalizability.

116. 【2510.16863】BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation

链接：https://arxiv.org/abs/2510.16863

作者：Shujian Gao,Yuan Wang,Zekuan Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reducing annotation cost, Semi-supervised medical image, match fully supervised, fully supervised performance, sharply reducing annotation

备注： 14 pages, 5 figures

点击查看摘要

Abstract:Semi-supervised medical image segmentation (SSMIS) seeks to match fully supervised performance while sharply reducing annotation cost. Mainstream SSMIS methods rely on \emph{label-space consistency}, yet they overlook the equally critical \emph{representation-space alignment}. Without harmonizing latent features, models struggle to learn representations that are both discriminative and spatially coherent. To this end, we introduce \textbf{Bilateral Alignment in Representation and Label spaces (BARL)}, a unified framework that couples two collaborative branches and enforces alignment in both spaces. For label-space alignment, inspired by co-training and multi-scale decoding, we devise \textbf{Dual-Path Regularization (DPR)} and \textbf{Progressively Cognitive Bias Correction (PCBC)} to impose fine-grained cross-branch consistency while mitigating error accumulation from coarse to fine scales. For representation-space alignment, we conduct region-level and lesion-instance matching between branches, explicitly capturing the fragmented, complex pathological patterns common in medical imagery. Extensive experiments on four public benchmarks and a proprietary CBCT dataset demonstrate that BARL consistently surpasses state-of-the-art SSMIS methods. Ablative studies further validate the contribution of each component. Code will be released soon.

117. 【2510.16854】ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification

链接：https://arxiv.org/abs/2510.16854

作者：Akhila Kambhatla,Taminul Islam,Khaled R Ahmed

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：weapon-related violence necessitates, violence necessitates automated, necessitates automated detection, automated detection systems, detection systems capable

备注： 9 pages with 4 figures and 5 tables. This is a preprint submitted to arXiv

点击查看摘要

Abstract:The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.

118. 【2510.16837】2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting

链接：https://arxiv.org/abs/2510.16837

作者：Haofan Ren,Qingsong Yan,Ming Lu,Rongfeng Lu,Zunjie Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：influenced neural fields, greatly influenced neural, Gaussian Splatting, Recent advancements, enables high-fidelity rendering

备注：

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1\% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.

119. 【2510.16833】From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display

链接：https://arxiv.org/abs/2510.16833

作者：Xiangyu Mu,Dongliang Zhou,Jie Hou,Haijun Zhang,Weili Guan

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Mannequin-based clothing displays, online fashion presentation, clothing displays offer, displays offer, offer a cost-effective

备注：

点击查看摘要

Abstract:Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods.

120. 【2510.16832】Robust Cross-Domain Adaptation in Texture Features Transferring for Wood Chip Moisture Content Prediction

链接：https://arxiv.org/abs/2510.16832

作者：Abdur Rahman,Mohammad Marufuzzaman,Jason Street,Haifeng Wang,Veera G. Gude,Randy Buchanan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ensuring energy efficiency, optimizing biofuel production, wood chip, wood chip moisture, energy efficiency

备注：

点击查看摘要

Abstract:Accurate and quick prediction of wood chip moisture content is critical for optimizing biofuel production and ensuring energy efficiency. The current widely used direct method (oven drying) is limited by its longer processing time and sample destructiveness. On the other hand, existing indirect methods, including near-infrared spectroscopy-based, electrical capacitance-based, and image-based approaches, are quick but not accurate when wood chips come from various sources. Variability in the source material can alter data distributions, undermining the performance of data-driven models. Therefore, there is a need for a robust approach that effectively mitigates the impact of source variability. Previous studies show that manually extracted texture features have the potential to predict wood chip moisture class. Building on this, in this study, we conduct a comprehensive analysis of five distinct texture feature types extracted from wood chip images to predict moisture content. Our findings reveal that a combined feature set incorporating all five texture features achieves an accuracy of 95% and consistently outperforms individual texture features in predicting moisture content. To ensure robust moisture prediction, we propose a domain adaptation method named AdaptMoist that utilizes the texture features to transfer knowledge from one source of wood chip data to another, addressing variability across different domains. We also proposed a criterion for model saving based on adjusted mutual information. The AdaptMoist method improves prediction accuracy across domains by 23%, achieving an average accuracy of 80%, compared to 57% for non-adapted models. These results highlight the effectiveness of AdaptMoist as a robust solution for wood chip moisture content estimation across domains, making it a potential solution for wood chip-reliant industries.

121. 【2510.16822】ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification

链接：https://arxiv.org/abs/2510.16822

作者：Yahia Battach,Abdulwahab Felemban,Faizan Farooq Khan,Yousef A. Radwan,Xiang Li,Fabio Marchese,Sara Beery,Burton H. Jones,Francesca Benzoni,Mohamed Elhoseiny

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：rapidly declining due, climate change, underscoring the urgent, rapidly declining, declining due

备注：

点击查看摘要

Abstract:Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.

122. 【2510.16814】Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

链接：https://arxiv.org/abs/2510.16814

作者：Simon Jaxy,Anton Theys,Patrick Willett,W. Chris Carleton,Ralf Vandam,Pieter Libin

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：predictive modelling estimates, Archaeological predictive modelling, modelling estimates, occur by combining, Conditional Random Field

备注：

点击查看摘要

Abstract:Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental, cultural, and geospatial variables. We address this challenge using a deep learning approach but must contend with structural label scarcity inherent to archaeology: positives are rare, and most locations are unlabeled. To address this, we adopt a semi-supervised, positive-unlabeled (PU) learning strategy, implemented as a semantic segmentation model and evaluated on two datasets covering a representative range of archaeological periods. Our approach employs dynamic pseudolabeling, refined with a Conditional Random Field (CRF) implemented via an RNN, increasing label confidence under severe class imbalance. On a geospatial dataset derived from a digital elevation model (DEM), our model performs on par with the state-of-the-art, LAMAP, while achieving higher Dice scores. On raw satellite imagery, assessed end-to-end with stratified k-fold cross-validation, it maintains performance and yields predictive surfaces with improved interpretability. Overall, our results indicate that semi-supervised learning offers a promising approach to identifying undiscovered sites across large, sparsely annotated landscapes.

123. 【2510.16800】An RGB-D Image Dataset for Lychee Detection and Maturity Classification for Robotic Harvesting

链接：https://arxiv.org/abs/2510.16800

作者：Zhenpeng Zhang,Yi Wang,Shanglei Chai,Yingying Liu,Zekai Xie,Wenhao Huang,Pengyu Li,Zipei Luo,Dajiang Lu,Yibin Tian

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：high-value subtropical fruit, high-value subtropical, Lychee, images, harvesting robots

备注：

点击查看摘要

Abstract:Lychee is a high-value subtropical fruit. The adoption of vision-based harvesting robots can significantly improve productivity while reduce reliance on labor. High-quality data are essential for developing such harvesting robots. However, there are currently no consistently and comprehensively annotated open-source lychee datasets featuring fruits in natural growing environments. To address this, we constructed a dataset to facilitate lychee detection and maturity classification. Color (RGB) images were acquired under diverse weather conditions, and at different times of the day, across multiple lychee varieties, such as Nuomici, Feizixiao, Heiye, and Huaizhi. The dataset encompasses three different ripeness stages and contains 11,414 images, consisting of 878 raw RGB images, 8,780 augmented RGB images, and 1,756 depth images. The images are annotated with 9,658 pairs of lables for lychee detection and maturity classification. To improve annotation consistency, three individuals independently labeled the data, and their results were then aggregated and verified by a fourth reviewer. Detailed statistical analyses were done to examine the dataset. Finally, we performed experiments using three representative deep learning models to evaluate the dataset. It is publicly available for academic

124. 【2510.16791】Personalized Image Filter: Mastering Your Photographic Style

链接：https://arxiv.org/abs/2510.16791

作者：Chengxuan Zhu,Shuchen Weng,Jiacong Fang,Peixuan Zhang,Si Li,Chao Xu,Boxin Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Photographic style, photographic concepts, Photographic, renowned photographers, charm behind renowned

备注：

点击查看摘要

Abstract:Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: this https URL

125. 【2510.16790】Unsupervised Monocular Road Segmentation for Autonomous Driving via Scene Geometry

链接：https://arxiv.org/abs/2510.16790

作者：Sara Hatami Rostami,Behrooz Nasihatkon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：costly manually labeled, manually labeled datasets, fully unsupervised approach, eliminating the reliance, paper presents

备注： 7 pages, 3 figures

点击查看摘要

Abstract:This paper presents a fully unsupervised approach for binary road segmentation (road vs. non-road), eliminating the reliance on costly manually labeled datasets. The method leverages scene geometry and temporal cues to distinguish road from non-road regions. Weak labels are first generated from geometric priors, marking pixels above the horizon as non-road and a predefined quadrilateral in front of the vehicle as road. In a refinement stage, temporal consistency is enforced by tracking local feature points across frames and penalizing inconsistent label assignments using mutual information maximization. This enhances both precision and temporal stability. On the Cityscapes dataset, the model achieves an Intersection-over-Union (IoU) of 0.82, demonstrating high accuracy with a simple design. These findings demonstrate the potential of combining geometric constraints and temporal consistency for scalable unsupervised road segmentation in autonomous driving.

126. 【2510.16785】Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

链接：https://arxiv.org/abs/2510.16785

作者：Jiazhen Liu,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Integrating diverse visual, Multimodal Large, Large Language

备注：

点击查看摘要

Abstract:Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

127. 【2510.16781】Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

链接：https://arxiv.org/abs/2510.16781

作者：Shihao Ji,Zihui Song

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Visual Language Models, large-scale Visual Language, Language Models, Visual Language, remarkable zero-shot reasoning

备注：

点击查看摘要

128. 【2510.16777】GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation

链接：https://arxiv.org/abs/2510.16777

作者：Junbo Li,Weimin Yuan,Yinuo Wang,Yue Zeng,Shihao Shu,Cai Meng,Xiangzhi Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：current research typically, research typically predicts, computer vision, fundamental task, task in computer

备注：

点击查看摘要

Abstract:Accurate 6D pose estimation of 3D objects is a fundamental task in computer vision, and current research typically predicts the 6D pose by establishing correspondences between 2D image features and 3D model features. However, these methods often face difficulties with textureless objects and varying illumination conditions. To overcome these limitations, we propose GS2POSE, a novel approach for 6D object pose estimation. GS2POSE formulates a pose regression algorithm inspired by the principles of Bundle Adjustment (BA). By leveraging Lie algebra, we extend the capabilities of 3DGS to develop a pose-differentiable rendering pipeline, which iteratively optimizes the pose by comparing the input image to the rendered image. Additionally, GS2POSE updates color parameters within the 3DGS model, enhancing its adaptability to changes in illumination. Compared to previous models, GS2POSE demonstrates accuracy improvements of 1.4\%, 2.8\% and 2.5\% on the T-LESS, LineMod-Occlusion and LineMod datasets, respectively.

129. 【2510.16776】EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation

链接：https://arxiv.org/abs/2510.16776

作者：Mingzheng Zhang,Jinfeng Gao,Dan Xu,Jiangrui Yu,Yuhan Qiao,Lan Chen,Jin Tang,Xiao Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：patient wait times, significantly reduce diagnostic, reduce diagnostic burdens, Large Language Models, wait times

备注：

点击查看摘要

Abstract:X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on this https URL.

130. 【2510.16772】Region in Context: Text-condition Image editing with Human-like semantic reasoning

链接：https://arxiv.org/abs/2510.16772

作者：Thuy Phuong Vu,Dinh-Cuong Hoang,Minhhuy Le,Phan Xuan Tan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：made significant progress, Recent research, based on text, research has made, made significant

备注：

点击查看摘要

Abstract:Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: this https URL

131. 【2510.16765】WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement

链接：https://arxiv.org/abs/2510.16765

作者：Shengyu Zhu, Fan,Fuxuan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：frameworks demonstrate significant, significant computational efficiency, demonstrate significant computational, CNN-based frameworks demonstrate, computer vision

备注： Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Oral

点击查看摘要

Abstract:Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.

132. 【2510.16756】End-to-end Listen, Look, Speak and Act

链接：https://arxiv.org/abs/2510.16756

作者：Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Lu Lu,Chao Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Audio and Speech Processing (eess.AS)

关键词：models simulating humans, fluidly adapt, Human interaction, speak while acting, inherently multimodal

备注： 22 pages, 8 figures

点击查看摘要

133. 【2510.16752】Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution

链接：https://arxiv.org/abs/2510.16752

作者：Ivan Molodetskikh,Kirill Malyshev,Mark Mirgaleev,Nikita Zagainov,Evgeney Bogatyrev,Dmitriy Vatolin

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Generative image super-resolution, Generative image, rapidly advancing, advancing in visual, detail restoration

备注：

点击查看摘要

Abstract:Generative image super-resolution (SR) is rapidly advancing in visual quality and detail restoration. As the capacity of SR models expands, however, so does their tendency to produce artifacts: incorrect, visually disturbing details that reduce perceived quality. Crucially, their perceptual impact varies: some artifacts are barely noticeable while others strongly degrade the image. We argue that artifacts should be characterized by their prominence to human observers rather than treated as uniform binary defects. Motivated by this, we present a novel dataset of 1302 artifact examples from 11 contemporary image-SR methods, where each artifact is paired with a crowdsourced prominence score. Building on this dataset, we train a lightweight regressor that produces spatial prominence heatmaps and outperforms existing methods at detecting prominent artifacts. We release the dataset and code to facilitate prominence-aware evaluation and mitigation of SR artifacts.

134. 【2510.16751】Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

链接：https://arxiv.org/abs/2510.16751

作者：Erik Riise,Mehmet Onurcan Kaya,Dim P. Papadopoulos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：revolutionized Large Language, Large Language Models, Large Language, revolutionized Large, Language Models

备注：

点击查看摘要

Abstract:While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

135. 【2510.16732】A Comprehensive Survey on World Models for Embodied AI

链接：https://arxiv.org/abs/2510.16732

作者：Xinqing Li,Xin He,Le Zhang,Yun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：actions reshape future, future world states, reshape future world, Spatial Latent Grid, agents that perceive

备注： [this https URL](https://github.com/Li-Zn-H/AwesomeWorldModels)

点击查看摘要

Abstract:Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three-axis taxonomy encompassing: (1) Functionality, Decision-Coupled vs. General-Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state-level understanding, and task performance. Furthermore, we offer a quantitative comparison of state-of-the-art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade-off between model performance and the computational efficiency required for real-time control, and the core modeling difficulty of achieving long-horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at this https URL.

136. 【2510.16730】UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid

链接：https://arxiv.org/abs/2510.16730

作者：Tianyang Dou,Ming Li,Jiangying Qin,Xuan Liao,Jiageng Zhong,Armin Gruen,Mengyi Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Allen Coral Atlas, Allen Coral, Coral Atlas, coral reef distri-bution, require accurate large-scale

备注：

点击查看摘要

Abstract:Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.

137. 【2510.16729】Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models

链接：https://arxiv.org/abs/2510.16729

作者：Jianbiao Mei,Yu Yang,Xuemeng Yang,Licheng Wen,Jiajun Lv,Botian Shi,Yong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving systems, driving systems increasingly, systems increasingly rely, vision-centric world models, Residual World Model

备注：

点击查看摘要

Abstract:End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

138. 【2510.16714】Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

链接：https://arxiv.org/abs/2510.16714

作者：Xiongkun Linghu,Jiangyong Huang,Ziyu Zhu,Baoxiong Jia,Siyuan Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, Existing research, primarily due

备注： Project page: [this https URL](https://scenecot.github.io/)

点击查看摘要

Abstract:Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

139. 【2510.16709】HumanCM: One Step Human Motion Prediction

链接：https://arxiv.org/abs/2510.16709

作者：Liu Haojie,Gao Suixiang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：prediction framework built, one-step human motion, human motion prediction, one-step human, built upon consistency

备注： 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We present HumanCM, a one-step human motion prediction framework built upon consistency models. Instead of relying on multi-step denoising as in diffusion-based methods, HumanCM performs efficient single-step generation by learning a self-consistent mapping between noisy and clean motion states. The framework adopts a Transformer-based spatiotemporal architecture with temporal embeddings to model long-range dependencies and preserve motion coherence. Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves comparable or superior accuracy to state-of-the-art diffusion models while reducing inference steps by up to two orders of magnitude.

140. 【2510.16704】Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization

链接：https://arxiv.org/abs/2510.16704

作者：Tianxin Wei,Yifan Chen,Xinrui He,Wenxuan Bao,Jingrui He

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Distribution shifts, testing samples frequently, samples frequently occur, impede model generalization, shifts between training

备注： Accepted by KDD 2025

点击查看摘要

Abstract:Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through this https URL

141. 【2510.16702】SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation

链接：https://arxiv.org/abs/2510.16702

作者：Huy Minh Nhat Nguyen,Triet Hoang Minh Dao,Chau Vinh Hoang Truong,Cuong Tuan Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Optical Coherence Tomography, Optical Coherence, Coherence Tomography, detailed three-dimensional views, noisy OCT images

备注： 2025 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) is a widely used non-invasive imaging technique that provides detailed three-dimensional views of the retina, which are essential for the early and accurate diagnosis of ocular diseases. Consequently, OCT image analysis and processing have emerged as key research areas in biomedical imaging. However, acquiring paired datasets of clean and real-world noisy OCT images for supervised denoising models remains a formidable challenge due to intrinsic speckle noise and practical constraints in clinical imaging environments. To address these issues, we propose SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation. Our novel approach leverages only noisy OCT images by first generating pseudo-ground-truth images through self-fusion and self-supervised denoising. These refined images then serve as targets to train an ensemble of denoising models using a patch-based strategy that effectively enhances image clarity. Performance improvements are validated via metrics such as Contrast-to-Noise Ratio (CNR), Mean Square Ratio (MSR), Texture Preservation (TP), and Edge Preservation (EP) on the real-world dataset from the IEEE SPS Video and Image Processing Cup. Notably, the VIP Cup dataset contains only real-world noisy OCT images without clean references, highlighting our method's potential for improving image quality and diagnostic outcomes in clinical practice.

142. 【2510.16688】Pursuing Minimal Sufficiency in Spatial Reasoning

链接：https://arxiv.org/abs/2510.16688

作者：Yejie Guo,Yunzhong Hou,Wufei Ma,Meng Tang,Ming-Hsuan Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Minimal Sufficient Spatial, Minimal Sufficient Set, Sufficient Spatial Reasoner, remains a persistent, ability to ground

备注：

点击查看摘要

Abstract:Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at this https URL.

143. 【2510.16684】Filtering of Small Components for Isosurface Generation

链接：https://arxiv.org/abs/2510.16684

作者：Devin Zhao,Rephael Wenger

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：scalar field, mathbb, sigma, Abstract, rightarrow

备注： 8 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Let $f: \mathbb{R}^3 \rightarrow \mathbb{R}$ be a scalar field. An isosurface is a piecewise linear approximation of a level set $f^{-1}(\sigma)$ for some $\sigma \in \mathbb{R}$ built from some regular grid sampling of $f$. Isosurfaces constructed from scanned data such as CT scans or MRIs often contain extremely small components that distract from the visualization and do not form part of any geometric model produced from the data. Simple prefiltering of the data can remove such small components while having no effect on the large components that form the body of the visualization. We present experimental results on such filtering.

144. 【2510.16664】HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications

链接：https://arxiv.org/abs/2510.16664

作者：Christopher Thirgood,Oscar Mendez,Erin Ling,Jon Storey,Simon Hadfield

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Spectral Reconstruction, Toggle, spectral Reconstruction Architecture, Toggle Hugging Face, Code Toggle Papers

备注：

点击查看摘要

Abstract:Hyperspectral images (HSI) promise to support a range of new applications in computer vision. Recent research has explored the feasibility of generalizable Spectral Reconstruction (SR), the problem of recovering a HSI from a natural three-channel color image in unseen scenarios. However, previous Multi-Scale Attention (MSA) works have only demonstrated sufficient generalizable results for very sparse spectra, while modern HSI sensors contain hundreds of channels. This paper introduces a novel approach to spectral reconstruction via our HYbrid knowledge Distillation and spectral Reconstruction Architecture (HYDRA). Using a Teacher model that encapsulates latent hyperspectral image data and a Student model that learns mappings from natural images to the Teacher's encoded domain, alongside a novel training method, we achieve high-quality spectral reconstruction. This addresses key limitations of prior SR models, providing SOTA performance across all metrics, including an 18\% boost in accuracy, and faster inference times than current SOTA models at various channel depths.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.16664 [cs.CV]

(or
arXiv:2510.16664v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.16664

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Christopher Thirgood [view email] [v1]
Sat, 18 Oct 2025 23:29:30 UTC (3,044 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications, by Christopher Thirgood and 4 other authorsView PDFHTML (experimental)TeX Source

view license