本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1095篇论文,其中:

  • 自然语言处理159
  • 信息检索22
  • 计算机视觉211

自然语言处理

1. 【2510.17800】Glyph: Scaling Context Windows via Visual-Text Compression

链接https://arxiv.org/abs/2510.17800

作者:Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, increasingly rely, multi-step reasoning, Large

备注

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at this https URL.

2. 【2510.17797】Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

链接https://arxiv.org/abs/2510.17797

作者:Akshara Prabhakar,Roshan Ram,Zixiang Chen,Silvio Savarese,Frank Wang,Caiming Xiong,Huan Wang,Weiran Yao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:information grows exponentially, face increasing pressure, transform unstructured data, enterprises face increasing, Master Planning Agent

备注: Technical report; 13 pages plus references and appendices

点击查看摘要

Abstract:As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at this https URL and Dataset at this https URL

Comments:
Technical report; 13 pages plus references and appendices

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2510.17797 [cs.CL]

(or
arXiv:2510.17797v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.17797

Focus to learn more

              arXiv-issued DOI via DataCite</p>
3. 【2510.17795】Executable Knowledge Graphs for Replicating AI Research

链接https://arxiv.org/abs/2510.17795

作者:Yujie Luo,Zhuoyun Yu,Xuehai Wang,Yuqi Zhu,Ningyu Zhang,Lanning Wei,Lun Du,Da Zheng,Huajun Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)

关键词:large language model, Replicating AI research, language model, crucial yet challenging, challenging task

备注: Work in progress

点击查看摘要

Abstract:Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at this https URL.

4. 【2510.17793】Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

链接https://arxiv.org/abs/2510.17793

作者:Austin Xu,Xuan-Phi Nguyen,Yilun Zhou,Chien-Sheng Wu,Caiming Xiong,Shafiq Joty

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Finetuning specialized generative, popular paradigm, paradigm to meet, meet the increasing, increasing demand

备注: 29 pages, 9 tables, 6 figures

点击查看摘要

Abstract:Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

5. 【2510.17790】UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

链接https://arxiv.org/abs/2510.17790

作者:Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:require accurate visual, accurate visual grounding, lengthy execution chains, Multimodal agents, programmatic tool calls

备注

点击查看摘要

Abstract:Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

6. 【2510.17776】Mapping Post-Training Forgetting in Language Models at Scale

链接https://arxiv.org/abs/2510.17776

作者:Jackson Harmon,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains poorly understood, largest capability gains, knowledge remains poorly, Scaled post-training, backward transfer

备注: 43 pages,15 figures

点击查看摘要

Abstract:Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not "average out" by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1-0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0-1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction-tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.

7. 【2510.17764】Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

链接https://arxiv.org/abs/2510.17764

作者:Xiao Ye,Jacob Dineen,Zhaonan Li,Zhikun Xu,Weiyu Chen,Shijie Lu,Yuxi Huang,Ming Shen,Phu Tran,Ji-Eun Irene Yum,Muhammad Ali Khan,Muhammad Umar Afzal,Irbaz Bin Riaz,Ben Zhou

类目:Computation and Language (cs.CL)

关键词:Medical Large language, Large language models, Medical Large, language models achieve, models achieve strong

备注

点击查看摘要

Abstract:Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.

8. 【2510.17759】VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

链接https://arxiv.org/abs/2510.17759

作者:Qilin Liao,Anamika Lochab,Ruqi Zhang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:extend large language, large language models, extend large, visual reasoning, large language

备注: 18 pages, 7 Figures,

点击查看摘要

Abstract:Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

9. 【2510.17733】rain for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

链接https://arxiv.org/abs/2510.17733

作者:Tong Chen,Akari Asai,Luke Zettlemoyer,Hannaneh Hajishirzi,Faeze Brahman

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:incorrect information unsupported, Language models, information unsupported, Language, training data

备注

点击查看摘要

Abstract:Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

10. 【2510.17725】AcademicEval: Live Long-Context LLM Benchmark

链接https://arxiv.org/abs/2510.17725

作者:Haozhen Zhang,Tao Feng,Pengrui Han,Jiaxuan You

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, recently achieved remarkable, achieved remarkable performance

备注: Accepted by TMLR. Code is available at [this https URL](https://github.com/ulab-uiuc/AcademicEval)

点击查看摘要

Abstract:Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at this https URL

11. 【2510.17720】PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

链接https://arxiv.org/abs/2510.17720

作者:Nanda Kumar Rengarajan,Jun Yan,Chun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Named Entity Recognition, requires substantial annotated, substantial annotated data, Named Entity, Entity Recognition

备注

点击查看摘要

Abstract:Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.

12. 【2510.17715】QueST: Incentivizing LLMs to Generate Difficult Problems

链接https://arxiv.org/abs/2510.17715

作者:Hanxu Hu,Xingxing Zhang,Jannis Vamvas,Rico Sennrich,Furu Wei

类目:Computation and Language (cs.CL)

关键词:solving competition-level coding, solving competition-level, problems, coding, Large Language Models

备注: 20 pages, 7 figures

点击查看摘要

Abstract:Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.

13. 【2510.17705】Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models

链接https://arxiv.org/abs/2510.17705

作者:Dayan Pan,Zhaoyang Fu,Jingyuan Wang,Xiao Han,Yue Zhu,Xiangyu Zhao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, possess remarkable generalization, remarkable generalization capabilities

备注: Accepted by CIKM' 25

点击查看摘要

Abstract:Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. The implemented code and data are available to ease reproducibility at this https URL.

14. 【2510.17698】owards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues

链接https://arxiv.org/abs/2510.17698

作者:Liqun He,Manolis Mavrikis,Mutlu Cukurova

类目:Computation and Language (cs.CL)

关键词:existing evaluation methods, AIED Doctoral Consortium, large language models, Doctoral Consortium paper, primarily focus

备注

点击查看摘要

Abstract:Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies.

15. 【2510.17671】LILO: Bayesian Optimization with Interactive Natural Language Feedback

链接https://arxiv.org/abs/2510.17671

作者:Katarzyna Kobalczyk,Zhiyuan Jerry Lin,Benjamin Letham,Zhuokai Zhao,Maximilian Balandat,Eytan Bakshy

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:quantifiable optimization objectives, real-world applications, translating complex, optimization objectives, essential in translating

备注

点击查看摘要

Abstract:For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.

16. 【2510.17662】DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

链接https://arxiv.org/abs/2510.17662

作者:Massa Baali,Rita Singh,Bhiksha Raj

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:achieved remarkable success, capturing speaker-discriminative features, speaker-discriminative features critical, achieved remarkable, remarkable success

备注

点击查看摘要

Abstract:Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.

17. 【2510.17652】Qomhra: A Bilingual Irish-English Large Language Model

链接https://arxiv.org/abs/2510.17652

作者:Joseph McInerney

类目:Computation and Language (cs.CL)

关键词:bilingual continued pre-training, pipeline spanning bilingual, spanning bilingual continued, large language model, low-resource constraints presenting

备注

点击查看摘要

Abstract:This paper introduces Qomhrá, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhrá is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhrá also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.

18. 【2510.17638】LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

链接https://arxiv.org/abs/2510.17638

作者:Qingchuan Yang,Simon Mahns,Sida Li,Anri Gu,Jibang Wu,Haifeng Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:fundamental intellectual pursuit, finance and economics, fundamental intellectual, intellectual pursuit, significant importance

备注: [this https URL](https://www.prophetarena.co/)

点击查看摘要

Abstract:Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.

19. 【2510.17620】Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models

链接https://arxiv.org/abs/2510.17620

作者:Yuefeng Peng,Parnian Afshar,Megan Ganji,Thomas Butler,Amir Houmansadr,Mingxian Wang,Dezhi Hong

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, compliant model responses, encode sensitive information, encode sensitive

备注

点击查看摘要

Abstract:Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.

20. 【2510.17602】LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis

链接https://arxiv.org/abs/2510.17602

作者:Huiyuan Xie,Chenyang Li,Huining Zhu,Chubin Zhang,Yuxiao Ye,Zhenghao Liu,Zhiyuan Liu

类目:Computation and Language (cs.CL)

关键词:Legal reasoning, Legal, reasoning, modeling legal reasoning, legal analysis

备注

点击查看摘要

Abstract:Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.

21. 【2510.17598】Reasoning Distillation and Structural Alignment for Improved Code Generation

链接https://arxiv.org/abs/2510.17598

作者:Amir Jalilifard,Anderson de Rezende Rocha,Marcos Medeiros Raimundo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Effective code generation, passing diverse test, diverse test cases, target programming language, Effective code

备注

点击查看摘要

Abstract:Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.

22. 【2510.17591】HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection

链接https://arxiv.org/abs/2510.17591

作者:Guang Yang,Yujie Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:Pre-trained language models, high-order data correlations, Pre-trained language, high-order data, data correlations

备注: Accepted by the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) as a findings long paper

点击查看摘要

Abstract:Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.

23. 【2510.17590】MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

链接https://arxiv.org/abs/2510.17590

作者:Mir Nafis Sharear Shopnil,Sharad Duwal,Abhishek Tyagi,Adiba Mahbub Proma

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:overwhelming manual fact-checking, manual fact-checking capacity, daily multimodal posts, overwhelming manual, fact-checking capacity

备注: 16 pages, 3 tables, 1 figure

点击查看摘要

Abstract:Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

24. 【2510.17555】Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

链接https://arxiv.org/abs/2510.17555

作者:Collin Zhang,Fei Huang,Chenhan Yuan,Junyang Lin

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, text generation, experience language confusion, language confusion

备注

点击查看摘要

Abstract:Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at this https URL.

25. 【2510.17548】When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

链接https://arxiv.org/abs/2510.17548

作者:Nisrine Rair,Alban Goupil,Valeriu Vrabie,Emmanuel Chochoy

类目:Computation and Language (cs.CL)

关键词:human annotators disagree, internally represent ambiguity, models internally represent, annotators disagree, evaluated with scalar

备注: Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference)

点击查看摘要

Abstract:Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.

Comments:
Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference)

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2510.17548 [cs.CL]

(or
arXiv:2510.17548v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.17548

Focus to learn more

              arXiv-issued DOI via DataCite</p>
26. 【2510.17532】OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

链接https://arxiv.org/abs/2510.17532

作者:Raghu Vamshi Hemadri,Geetha Krishna Guruju,Kristi Topollai,Anna Ewa Choromanska

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Predicting cancer treatment, Predicting cancer, cancer treatment outcomes, treatment outcomes requires, heterogeneous clinical data

备注

点击查看摘要

Abstract:Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

27. 【2510.17516】SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

链接https://arxiv.org/abs/2510.17516

作者:Tiancheng Hu,Joachim Baumann,Lorenzo Lupo,Dirk Hovy,Nigel Collier,Paul Röttger

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:real human behaviors, reflect real human, faithfully reflect real, human behavior, real human

备注: Project Website: [this http URL](http://simbench.tiancheng.hu/) Data: [this https URL](https://huggingface.co/datasets/pitehu/SimBench)

点击查看摘要

Abstract:Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

28. 【2510.17509】Annotation-Efficient Universal Honesty Alignment

链接https://arxiv.org/abs/2510.17509

作者:Shiyu Ni,Keping Bi,Jiafeng Guo,Minghao Tang,Jingtong Wu,Zengxin Han,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:large language models, express calibrated confidence-is, calibrated confidence-is essential, Honesty alignment-the ability, language models

备注

点击查看摘要

Abstract:Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

29. 【2510.17504】Lingua Custodi's participation at the WMT 2025 Terminology shared task

链接https://arxiv.org/abs/2510.17504

作者:Jingshu Liu,Raheel Qader,Gaëtan Caillaut,Mariam Nakhlé

类目:Computation and Language (cs.CL)

关键词:learning BERT based, BERT based cross-lingual, transfer learning BERT, BERT based, learning BERT

备注

点击查看摘要

Abstract:While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at this https URL.

30. 【2510.17498】Deep Self-Evolving Reasoning

链接https://arxiv.org/abs/2510.17498

作者:Zihan Liu,Shun Zheng,Xumeng Wen,Yang Wang,Jiang Bian,Mao Yang

类目:Computation and Language (cs.CL)

关键词:large language models, cornerstone of advanced, large language, Long-form, DSER

备注

点击查看摘要

Abstract:Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.

31. 【2510.17491】Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

链接https://arxiv.org/abs/2510.17491

作者:Yihong Tang,Kehai Chen,Liang Yue,Jinxin Fan,Caishen Zhou,Xiaoguang Li,Yuyang Zhang,Mingming Zhao,Shixiong Kai,Kaiyang Guo,Xingshan Zeng,Wenjing Cun,Lifeng Shang,Min Zhang

类目:Computation and Language (cs.CL)

关键词:large language models, LLM agents capable, language models, rise of large, large language

备注

点击查看摘要

Abstract:With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.

32. 【2510.17489】DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

链接https://arxiv.org/abs/2510.17489

作者:Yongxin He,Shan Zhang,Yixuan Cao,Lei Ma,Ping Luo

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Detecting AI-involved text, Detecting AI-involved, combating misinformation, academic misconduct, essential for combating

备注: To appear in NeurIPS 2025

点击查看摘要

Abstract:Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at this https URL.

33. 【2510.17483】ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts

链接https://arxiv.org/abs/2510.17483

作者:Zheyue Tan,Zhiyuan Li,Tao Yuan,Dong Zhou,Weilin Liu,Yueqing Zhuang,Yadong Li,Guowei Niu,Cheng Qin,Zhuyu Yao,Congyi Liu,Haiyang Xu,Boxun Li,Guohao Dai,Bo Zhao,Yu Wang

类目:Computation and Language (cs.CL)

关键词:scale Large Language, Large Language Models, scale Large, Large Language, promising approach

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.

34. 【2510.17476】Disparities in Multilingual LLM-Based Healthcare QA

链接https://arxiv.org/abs/2510.17476

作者:Ipek Baris Schlicht,Burcu Sayin,Zhixue Zhao,Frederik M. Labonté,Cesare Barbera,Marco Viviani,Paolo Rosso,Lucie Flek

类目:Computation and Language (cs.CL)

关键词:Large Language Models, reliable health information, multilingual Large Language, Wiki Health Care, access to reliable

备注: Under review

点击查看摘要

Abstract:Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare QA across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.

35. 【2510.17460】Evaluating Large Language Models on Urdu Idiom Translation

链接https://arxiv.org/abs/2510.17460

作者:Muhammad Farmal Khan,Mousumi Akter

类目:Computation and Language (cs.CL)

关键词:limited prior attention, received limited prior, low resource languages, Large Language Models, Native Urdu

备注

点击查看摘要

Abstract:Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.

36. 【2510.17437】Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings

链接https://arxiv.org/abs/2510.17437

作者:Manuela Daniela Danu,George Marica,Constantin Suciu,Lucian Mihai Itu,Oladimeji Farri

类目:Computation and Language (cs.CL)

关键词:including patient diagnosis, treatment effects assessment, electronic health record, rapidly increasing volume, unlock biomedical knowledge

备注: 11 pages, 5 figures, 1 table, published in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024)

点击查看摘要

Abstract:The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.

37. 【2510.17431】Agentic Reinforcement Learning for Search is Unsafe

链接https://arxiv.org/abs/2510.17431

作者:Yushi Yang,Shreyansh Padarha,Andrew Lee,Adam Mahdi

类目:Computation and Language (cs.CL)

关键词:trains large language, autonomously call tools, Agentic reinforcement learning, large language models, reinforcement learning

备注

点击查看摘要

Abstract:Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.

38. 【2510.17426】Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging

链接https://arxiv.org/abs/2510.17426

作者:Tiancheng Hu,Benjamin Minixhofer,Nigel Collier

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:post-training is typically, typically framed, drop in task, alignment tax, task accuracy

备注

点击查看摘要

Abstract:The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.

39. 【2510.17415】BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

链接https://arxiv.org/abs/2510.17415

作者:Jiacheng Xie,Yang Yu,Yibo Chen,Hanyao Zhang,Lening Zhao,Jiaxuan He,Lei Jiang,Xiaoting Tang,Guanghui An,Dong Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM); Software Engineering (cs.SE)

关键词:Traditional Chinese Medicine, Chinese Medicine, Traditional Chinese, plays a role, global healthcare

备注

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.

40. 【2510.17405】AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages

链接https://arxiv.org/abs/2510.17405

作者:Mardiyyah Oduwole,Prince Mireku,Fatimo Adebanjo,Oluwatosin Olajide,Mahi Aminu Aliyu,Jekaterina Novikova

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:hindering the democratization, research has overwhelmingly, overwhelmingly focused, focused on high-resource, democratization of advancements

备注

点击查看摘要

Abstract:Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.

41. 【2510.17402】Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

链接https://arxiv.org/abs/2510.17402

作者:Jiacheng Xie,Shuai Zeng,Yang Yu,Xiaoting Tang,Guanghui An,Dong Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Traditional Chinese Medicine, structurally unique knowledge, challenges conventional applications, Chinese Medicine, large language models

备注

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.

42. 【2510.17389】EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

链接https://arxiv.org/abs/2510.17389

作者:Numaan Naeem,Abdellah El Mekki,Muhammad Abdul-Mageed

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:explaining complex concepts, Large language models, Large language, answering questions, explaining complex

备注: 28 pages, 2 figures, 14 tables, 50 listings, EMNLP 2025 Main

点击查看摘要

Abstract:Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at this https URL.

43. 【2510.17388】he Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

链接https://arxiv.org/abs/2510.17388

作者:Henry Lim,Kwan Hui Lim

类目:Computation and Language (cs.CL)

关键词:Instruction-tuned large language, exhibit strong zero-shot, strong zero-shot reasoning, Instruction-tuned large, exhibit strong

备注: 11 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

44. 【2510.17354】owards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.17354

作者:Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:enhancing large language, large language models, external corpus, powerful paradigm, paradigm for enhancing

备注: This work is in progress

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

45. 【2510.17289】Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning

链接https://arxiv.org/abs/2510.17289

作者:Hajar Bakarou,Mohamed Sinane El Messoussi,Anaïs Ollagnier

类目:Computation and Language (cs.CL)

关键词:including hate speech, poses growing risks, textit, Antisocial behavior, social media

备注

点击查看摘要

Abstract:Antisocial behavior (ASB) on social media -- including hate speech, harassment, and cyberbullying -- poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textit{multi-party conversational settings} remain underexplored due to limited data. To address this gap, we use \textit{CyberAgressionAdo-Large}, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textit{abuse detection}, \textit{bullying behavior analysis}, and \textit{bullying peer-group identification}. We benchmark six text-based and eight graph-based \textit{representation-learning methods}, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \texttt{mBERT + WD-SGCN} achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.

46. 【2510.17263】axoAlign: Scholarly Taxonomy Generation Using Language Models

链接https://arxiv.org/abs/2510.17263

作者:Avishek Lahiri,Yufang Hou,Debarshi Kumar Sanyal

类目:Computation and Language (cs.CL)

关键词:helping researchers structure, hierarchical manner, play a crucial, crucial role, role in helping

备注: This paper has been accepted at the EMNLP 2025 Main Conference

点击查看摘要

Abstract:Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at this https URL.

47. 【2510.17256】Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

链接https://arxiv.org/abs/2510.17256

作者:Shahin Atakishiyev,Housam K.B. Babiker,Jiayi Dai,Nawshad Farruque,Teruaki Hayashi,Nafisa Sadaf Hriti,Md Abed Rahman,Iain Smith,Mi-Young Kim,Osmar R. Zaïane,Randy Goebel

类目:Computation and Language (cs.CL)

关键词:exhibited impressive performance, natural language processing, Large language models, language models, Large language

备注

点击查看摘要

Abstract:Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

48. 【2510.17252】How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design

链接https://arxiv.org/abs/2510.17252

作者:Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Ayesha Siddiqua,Jungpil Shin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:media often shape, shape the public, public mood, Abstract, reflecting subtle emotional

备注: 15 pages, 7 figures, 4 tables. Submitted to the International Conference on Data and Applied Analytics (IDAA 2025)

点击查看摘要

Abstract:News media often shape the public mood not only by what they report but by how they frame it. The same event can appear calm in one outlet and alarming in another, reflecting subtle emotional bias in reporting. Negative or emotionally charged headlines tend to attract more attention and spread faster, which in turn encourages outlets to frame stories in ways that provoke stronger reactions. This research explores that tendency through large-scale emotion analysis of Bengali news. Using zero-shot inference with Gemma-3 4B, we analyzed 300000 Bengali news headlines and their content to identify the dominant emotion and overall tone of each. The findings reveal a clear dominance of negative emotions, particularly anger, fear, and disappointment, and significant variation in how similar stories are emotionally portrayed across outlets. Based on these insights, we propose design ideas for a human-centered news aggregator that visualizes emotional cues and helps readers recognize hidden affective framing in daily news.

49. 【2510.17247】From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

链接https://arxiv.org/abs/2510.17247

作者:Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, significantly enhanced, Recent, video generation, reward models trained

备注

点击查看摘要

Abstract:Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

50. 【2510.17238】StreamingThinker: Large Language Models Can Think While Reading

链接https://arxiv.org/abs/2510.17238

作者:Junlong Tong,Yingqi Fan,Anhao Zhao,Yunpu Ma,Xiaoyu Shen

类目:Computation and Language (cs.CL)

关键词:Large language models, demonstrated remarkable capabilities, Large language, chain of thought, reasoning

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{this https URL}{this repository.}

51. 【2510.17210】Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

链接https://arxiv.org/abs/2510.17210

作者:Chenchen Tan,Youyang Qu,Xinghao Li,Hui Zhang,Shujie Cui,Cunjian Chen,Longxiang Gao

类目:Computation and Language (cs.CL)

关键词:AI-assisted decision-making boost, large language models, increase in computing, computing power, necessity of AI-assisted

备注: 22 pages, 10 figures

点击查看摘要

Abstract:The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.

52. 【2510.17206】Soft-Masked Diffusion Language Models

链接https://arxiv.org/abs/2510.17206

作者:Michael Hersche,Samuel Moor-Smith,Thomas Hofmann,Abbas Rahimi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:traditional autoregressive approaches, demonstrated strong potential, offering various advantages, autoregressive approaches, demonstrated strong

备注

点击查看摘要

Abstract:Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.

53. 【2510.17205】$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

链接https://arxiv.org/abs/2510.17205

作者:Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, achieved strong performance, significant computational overhead

备注: EMNLP 2025 Main

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: this https URL.

54. 【2510.17196】Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

链接https://arxiv.org/abs/2510.17196

作者:Jiaqi Leng,Xiang Hu,Junxiong Wang,Jianguo Li,Wei Wu,Yucheng Lu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Effectively processing long, processing long contexts, long contexts, Bypassing Residual Path, full context due

备注: Preprint. Work in progress

点击查看摘要

Abstract:Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

55. 【2510.17173】Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

链接https://arxiv.org/abs/2510.17173

作者:Melik Ozolcer,Sang Won Bae

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:tool-augmented LLM health, LLM health coach, tool-augmented LLM, LLM health, study a web-deployed

备注: Accepted to the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models

点击查看摘要

Abstract:We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

56. 【2510.17168】When AI companions become witty: Can human brain recognize AI-generated irony?

链接https://arxiv.org/abs/2510.17168

作者:Xiaohui Rao,Hanlin Wu,Zhenguang G. Cai

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, mere computational output, Language Models, question emerges

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current LLMs' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.

57. 【2510.17139】Rethinking On-policy Optimization for Query Augmentation

链接https://arxiv.org/abs/2510.17139

作者:Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Recent advances, large language models, advances in large, large language, surge of interest

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.

58. 【2510.17132】Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction

链接https://arxiv.org/abs/2510.17132

作者:Ioannis Tsaknakis,Bingqing Song,Shuyu Gan,Dongyeop Kang,Alfredo Garcia,Gaowen Liu,Charles Fleming,Mingyi Hong

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, producing broadly relevant, broadly relevant text, Large Language, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2510.17132 [cs.LG]

(or
arXiv:2510.17132v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.17132

Focus to learn more

              arXiv-issued DOI via DataCite</p>
59. 【2510.17115】DVAGen: Dynamic Vocabulary Augmented Generation

链接https://arxiv.org/abs/2510.17115

作者:Wei Du,Nuowei Liu,Jie Wang,Jiahao Kuang,Tao Ji,Xiaoling Wang,Yuanbin Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:diverse token combinations, handling diverse token, fixed vocabulary struggle, Language models trained, limiting their flexibility

备注

点击查看摘要

Abstract:Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

60. 【2510.17109】Verification-Aware Planning for Multi-Agent Systems

链接https://arxiv.org/abs/2510.17109

作者:Tianyang Xu,Dan Zhang,Kushan Mitra,Estevam Hruschka

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:multiple specialized agents, Large language model, tackle complex tasks, Large language, specialized agents

备注: Submission for ARR Oct

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.

61. 【2510.17062】Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

链接https://arxiv.org/abs/2510.17062

作者:Guoqing Luo,Iffat Maab,Lili Mou,Junichi Yamagishi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reasoning-based large language, structured thinking process, large language models, language models excel, reasoning-based large

备注

点击查看摘要

Abstract:While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

62. 【2510.17028】Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

链接https://arxiv.org/abs/2510.17028

作者:Kyle Cox,Jiawei Xu,Yikun Han,Rong Xu,Tianhao Li,Chi-Yang Hsu,Tianlong Chen,Walter Gerych,Ying Ding

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:interesting behavior, behavior in large, prompt, uncertainty, prompt sensitivity

备注

点击查看摘要

Abstract:An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

63. 【2510.17021】Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

链接https://arxiv.org/abs/2510.17021

作者:Bingqi Shang,Yiwei Chen,Yihua Zhang,Bingquan Shen,Sijia Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language model, Large language, removing undesired data, general utility, critical mechanism

备注

点击查看摘要

Abstract:Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at this https URL.

64. 【2510.17018】Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification

链接https://arxiv.org/abs/2510.17018

作者:Noor Islam S. Mohammad

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:incur high computational, high computational costs, classical ensembles lack, ensembles lack semantic, comment detection remains

备注

点击查看摘要

Abstract:Toxic comment detection remains a challenging task, where transformer-based models (e.g., BERT) incur high computational costs and degrade on minority toxicity classes, while classical ensembles lack semantic adaptability. We propose xLSTM, a parameter-efficient and theoretically grounded framework that unifies cosine-similarity gating, adaptive feature prioritization, and principled class rebalancing. A learnable reference vector {v} in {R}^d modulates contextual embeddings via cosine similarity, amplifying toxic cues and attenuating benign signals to yield stronger gradients under severe class imbalance. xLSTM integrates multi-source embeddings (GloVe, FastText, BERT CLS) through a projection layer, a character-level BiLSTM for morphological cues, embedding-space SMOTE for minority augmentation, and adaptive focal loss with dynamic class weighting. On the Jigsaw Toxic Comment benchmark, xLSTM attains 96.0% accuracy and 0.88 macro-F1, outperforming BERT by 33% on threat and 28% on identity_hate categories, with 15 times fewer parameters and 50ms inference latency. Cosine gating contributes a +4.8% F1 gain in ablations. The results establish a new efficiency adaptability frontier, demonstrating that lightweight, theoretically informed architectures can surpass large pretrained models on imbalanced, domain-specific NLP tasks.

65. 【2510.17017】SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

链接https://arxiv.org/abs/2510.17017

作者:Qiusi Zhan,Angeline Budiman-Chan,Abdelrahman Zayed,Xingzhi Guo,Daniel Kang,Joo-Kyung Kim

类目:Computation and Language (cs.CL)

关键词:Large language model, answer open-domain questions, Large language, retrieve external information, agents iteratively generate

备注: Code: [this https URL](https://github.com/ZQS1943/SafeSearch)

点击查看摘要

Abstract:Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked "How can I track someone's location without their consent?", a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

66. 【2510.17013】DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

链接https://arxiv.org/abs/2510.17013

作者:Lanni Bu,Lauren Levin,Amir Zeldes

类目:Computation and Language (cs.CL)

关键词:Recent LLM benchmarks, Recent LLM, responses often tar, natural language understanding, LLM benchmark target

备注

点击查看摘要

Abstract:Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.

67. 【2510.17006】Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

链接https://arxiv.org/abs/2510.17006

作者:Masahiro Kaneko,Zeerak Talat,Timothy Baldwin

类目:Computation and Language (cs.CL)

关键词:large language models, Iterative jailbreak methods, induce harmful outputs, highly effective attack, effective attack strategy

备注

点击查看摘要

Abstract:Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

68. 【2510.17001】Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

链接https://arxiv.org/abs/2510.17001

作者:Yuval Reif,Guy Kaplan,Roy Schwartz

类目:Computation and Language (cs.CL)

关键词:Large language models, walk, Large language, shown to encode, linear directions

备注

点击查看摘要

Abstract:Large language models (LLMs) were shown to encode word form variations, such as "walk"-"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.

69. 【2510.17000】Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

链接https://arxiv.org/abs/2510.17000

作者:Masahiro Kaneko,Timothy Baldwin

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, language models, model reply, reply is observed, malicious users

备注: NeurIPS 2025 (spotlight)

点击查看摘要

Abstract:Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.

70. 【2510.16987】Back to Bytes: Revisiting Tokenization Through UTF-8

链接https://arxiv.org/abs/2510.16987

作者:Amit Moryossef,Clara Meister,Pavel Stepachev,Desmond Elliott

类目:Computation and Language (cs.CL)

关键词:minimalist byte-level tokenizer, tokenizer that maps, minimalist byte-level, byte-level tokenizer, maps text

备注

点击查看摘要

Abstract:We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, "thinking" spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.

71. 【2510.16985】Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection

链接https://arxiv.org/abs/2510.16985

作者:Akif Islam,Mohd Ruhul Ameen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:disproportionately affecting women, social media platforms, Bengali social media, disproportionately affecting, women and adolescents

备注: Accepted to IEEE COMPAS 2025. 6 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.

72. 【2510.16968】Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

链接https://arxiv.org/abs/2510.16968

作者:Pingzhi Li,Morris Yu-Chao Huang,Zhen Tan,Qingquan Song,Jie Peng,Kai Zou,Yu Cheng,Kaidi Xu,Tianlong Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:poses intellectual property, intellectual property protection, LLM diversity risks, large language models, Knowledge Distillation

备注: Code is at [this https URL](https://github.com/unites-lab/shadow-moe)

点击查看摘要

Abstract:Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate 94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.

73. 【2510.16952】Real-Time World Crafting: Generating Structured Game Behaviors from Natural Language with Large Language Models

链接https://arxiv.org/abs/2510.16952

作者:Austin Drake,Hang Dong

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:safely integrating Large, integrating Large Language, Large Language Models, interactive game engines, integrating Large

备注: 16 pages, 11 figures (including appendix). To be presented at the 5th Wordplay @ EMNLP workshop (2025)

点击查看摘要

Abstract:We present a novel architecture for safely integrating Large Language Models (LLMs) into interactive game engines, allowing players to "program" new behaviors using natural language. Our framework mitigates risks by using an LLM to translate commands into a constrained Domain-Specific Language (DSL), which configures a custom Entity-Component-System (ECS) at runtime. We evaluated this system in a 2D spell-crafting game prototype by experimentally assessing models from the Gemini, GPT, and Claude families with various prompting strategies. A validated LLM judge qualitatively rated the outputs, showing that while larger models better captured creative intent, the optimal prompting strategy is task-dependent: Chain-of-Thought improved creative alignment, while few-shot examples were necessary to generate more complex DSL scripts. This work offers a validated LLM-ECS pattern for emergent gameplay and a quantitative performance comparison for developers.

74. 【2510.16943】Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

链接https://arxiv.org/abs/2510.16943

作者:Dania Refai,Moataz Ahmed

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, convert natural language, natural language descriptions, Large language, natural language

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective. Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.

75. 【2510.16932】Prompt-MII: Meta-Learning Instruction Induction for LLMs

链接https://arxiv.org/abs/2510.16932

作者:Emily Xiao,Yixiao Zeng,Ada Chen,Chin-Jou Li,Amanda Bertsch,Graham Neubig

类目:Computation and Language (cs.CL)

关键词:context length grows, adapt large language, incurs high inference, high inference costs, large language models

备注

点击查看摘要

Abstract:A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.

76. 【2510.16928】ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

链接https://arxiv.org/abs/2510.16928

作者:Emily Chang,Niyati Bafna

类目:Computation and Language (cs.CL)

关键词:restricted to high, largely restricted, large language models, higher-order tasks, Existing

备注

点击查看摘要

Abstract:Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

77. 【2510.16926】Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

链接https://arxiv.org/abs/2510.16926

作者:Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注: 23 pages,19 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

78. 【2510.16924】Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

链接https://arxiv.org/abs/2510.16924

作者:Zhihui Yang,Yupei Wang,Kaijie Mo,Zhe Zhao,Renfen Hu

类目:Computation and Language (cs.CL)

关键词:multimodal language models, embodied knowledge understanding, embodied knowledge, embodied knowledge compared, significant progress

备注: Accepted to EMNLP 2025 (Findings). This version corrects a redundant sentence in the Results section that appeared in the camera-ready version

点击查看摘要

Abstract:Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models' perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.

79. 【2510.16917】SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

链接https://arxiv.org/abs/2510.16917

作者:Chih-Kai Yang,Yen-Ting Piao,Tzu-Wen Hsu,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Large Audio-Language Models, full retraining, offers an efficient, prior work, work has concentrated

备注: Work in progress

点击查看摘要

Abstract:Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability. Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates. SAKE provides a principled framework to study how knowledge editing extends to the auditory modalities, opening new directions for maintaining and adapting LALMs in more diverse real-world scenarios.

80. 【2510.16907】VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

链接https://arxiv.org/abs/2510.16907

作者:Kangrui Wang,Pingyue Zhang,Zihan Wang,Yaning Gao,Linjie Li,Qineng Wang,Hanyang Chen,Chi Wan,Yiping Lu,Zhengyuan Yang,Lijuan Wang,Ranjay Krishna,Jiajun Wu,Li Fei-Fei,Yejin Choi,Manling Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:complex visual observations, key challenge, shift from textual, Partially Observable Markov, Observable Markov Decision

备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at this https URL.

81. 【2510.16893】Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

链接https://arxiv.org/abs/2510.16893

作者:Bo-Han Feng,Chien-Feng Liu,Yu-Hsuan Li Liang,Chih-Kai Yang,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Large audio-language models, extend text-based LLMs, Large audio-language, audio-language models, extend text-based

备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.

82. 【2510.16882】Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

链接https://arxiv.org/abs/2510.16882

作者:Heming Zou,Yixiu Mao,Yun Qu,Qi Wang,Xiangyang Ji

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:adapt large language, large language models, Supervised fine-tuning, downstream tasks, commonly used technique

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbf{UDS (Utility-Diversity Sampling)}, a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at this https URL.

83. 【2510.16872】DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

链接https://arxiv.org/abs/2510.16872

作者:Shaolei Zhang,Ju Fan,Meihao Fan,Guoliang Li,Xiaoyong Du

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)

关键词:Autonomous data science, powerful large language, Autonomous data, deep research reports, data science

备注: Code: [this https URL](https://github.com/ruc-datalab/DeepAnalyze) Model: [this https URL](https://huggingface.co/RUC-DataLab/DeepAnalyze-8B)

点击查看摘要

Abstract:Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.

84. 【2510.16851】Neuronal Group Communication for Efficient Neural representation

链接https://arxiv.org/abs/2510.16851

作者:Zhengqi Pei,Qingming Huang,Shuhui Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

关键词:alongside daunting challenges, brought unprecedented performance, unprecedented performance alongside, performance alongside daunting, efficiency and interpretability

备注: 28 pages, 2 figures

点击查看摘要

Abstract:The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.

85. 【2510.16844】FinSight: Towards Real-World Financial Deep Research

链接https://arxiv.org/abs/2510.16844

作者:Jiajie Jin,Yuyao Zhang,Yimeng Xu,Hongjin Qian,Yutao Zhu,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:intellectually demanding process, Generating professional financial, professional financial reports, fully automate, labor-intensive and intellectually

备注: Working in progress

点击查看摘要

Abstract:Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.

86. 【2510.16830】Verifiable Fine-Tuning for LLMs: Zero-Knowledge Training Proofs Bound to Data Provenance and Policy

链接https://arxiv.org/abs/2510.16830

作者:Hasan Akgul,Daniel Borg,Arta Berisha,Amina Rahimova,Andrej Novak,Mila Petrov

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large language models, current release practices, release practices provide, practices provide weak, provide weak assurances

备注: 20 pages, 10 figures

点击查看摘要

Abstract:Large language models are often adapted through parameter efficient fine tuning, but current release practices provide weak assurances about what data were used and how updates were computed. We present Verifiable Fine Tuning, a protocol and system that produces succinct zero knowledge proofs that a released model was obtained from a public initialization under a declared training program and an auditable dataset commitment. The approach combines five elements. First, commitments that bind data sources, preprocessing, licenses, and per epoch quota counters to a manifest. Second, a verifiable sampler that supports public replayable and private index hiding batch selection. Third, update circuits restricted to parameter efficient fine tuning that enforce AdamW style optimizer semantics and proof friendly approximations with explicit error budgets. Fourth, recursive aggregation that folds per step proofs into per epoch and end to end certificates with millisecond verification. Fifth, provenance binding and optional trusted execution property cards that attest code identity and constants. On English and bilingual instruction mixtures, the method maintains utility within tight budgets while achieving practical proof performance. Policy quotas are enforced with zero violations, and private sampling windows show no measurable index leakage. Federated experiments demonstrate that the system composes with probabilistic audits and bandwidth constraints. These results indicate that end to end verifiable fine tuning is feasible today for real parameter efficient pipelines, closing a critical trust gap for regulated and decentralized deployments.

87. 【2510.16829】Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation

链接https://arxiv.org/abs/2510.16829

作者:Navreet Kaur,Hoda Ayad,Hayoung Jung,Shravika Mittal,Munmun De Choudhury,Tanushree Mitra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:Language model users, Language model, personal and social, Language, User-centric Question Simulation

备注

点击查看摘要

Abstract:Language model users often embed personal and social context in their questions. The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response. However, most evaluations, while capturing the model's capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses. We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.

88. 【2510.16819】Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank

链接https://arxiv.org/abs/2510.16819

作者:Shantanu Agarwal,Joel Barry,Steven Fincke,Scott Miller

类目:Computation and Language (cs.CL)

关键词:Authorship attribution, candidate authors, task of identifying, query document, predefined set

备注

点击查看摘要

Abstract:Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.

89. 【2510.16815】Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities

链接https://arxiv.org/abs/2510.16815

作者:Hans Hergen Lehmann,Jae Hee Lee,Steven Schockaert,Stefan Wermter

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, genuine knowledge versus, knowledge versus superficial, heuristics remains challenging

备注: 33 pages, 20 figures. Submitted ACL ARR 2025 October (under review)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?''), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model's own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7--8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.

90. 【2510.16809】When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

链接https://arxiv.org/abs/2510.16809

作者:Amirkia Rafiei Oskooei,Kaan Baturalp Cosdan,Husamettin Isiktas,Mehmet S. Aktas

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词:Large Language Models, Large Language, Language Models, vast context windows, context windows offer

备注

点击查看摘要

Abstract:Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.

91. 【2510.16797】MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

链接https://arxiv.org/abs/2510.16797

作者:Vera Pavlova,Mohammed Makhlouf

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:In-domain Contrastive learning, sentence embedding models, introduce MOSAIC, incorporates joint domain-specific, In-domain Contrastive

备注

点击查看摘要

Abstract:We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

92. 【2510.16783】LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

链接https://arxiv.org/abs/2510.16783

作者:Sheikh Jubair,Arwa Omayrah,Amal Alshammari,Alhanoof Althnian,Abdulhamed Alothaimen,Norah A. Alzahrani,Shahad D. Alzaidi,Nora Al-Twairesh,Abdulmohsen Al-Thubaity

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, demonstrated sophisticated capabilities, Recent advancements, advancements in Large

备注: 1 figure, 15 tables, 10 main pages

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.

93. 【2510.16781】Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

链接https://arxiv.org/abs/2510.16781

作者:Shihao Ji,Zihui Song

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Visual Language Models, large-scale Visual Language, Language Models, Visual Language, remarkable zero-shot reasoning

备注

点击查看摘要

Abstract:The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

94. 【2510.16769】See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

链接https://arxiv.org/abs/2510.16769

作者:Shuo Han,Yukun Cao,Zezhong Ding,Zengyi Gao,S Kevin Zhou,Xike Xie

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:lacking effective mechanisms, Vision-language models, facing scalability bottlenecks, input-token constraints, graph understanding

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

95. 【2510.16761】Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games

链接https://arxiv.org/abs/2510.16761

作者:Yikai Zhang,Ye Rong,Siyu Yuan,Jiangjie Chen,Jian Xie,Yanghua Xiao

类目:Computation and Language (cs.CL)

关键词:Existing language agents, Existing language, dynamic adversarial games, encounter difficulties, due to poor

备注

点击查看摘要

Abstract:Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.

96. 【2510.16756】End-to-end Listen, Look, Speak and Act

链接https://arxiv.org/abs/2510.16756

作者:Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Lu Lu,Chao Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Audio and Speech Processing (eess.AS)

关键词:models simulating humans, fluidly adapt, Human interaction, speak while acting, inherently multimodal

备注: 22 pages, 8 figures

点击查看摘要

Abstract:Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

97. 【2510.16743】Zero-Shot Performance Prediction for Probabilistic Scaling Laws

链接https://arxiv.org/abs/2510.16743

作者:Viktoria Schram,Markus Hiller,Daniel Beck,Trevor Cohn

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Natural Language Processing, Language Processing, Natural Language, specific performance objectives, enables informed decision-making

备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:The prediction of learning curves for Natural Language Processing (NLP) models enables informed decision-making to meet specific performance objectives, while reducing computational overhead and lowering the costs associated with dataset acquisition and curation. In this work, we formulate the prediction task as a multitask learning problem, where each task's data is modelled as being organized within a two-layer hierarchy. To model the shared information and dependencies across tasks and hierarchical levels, we employ latent variable multi-output Gaussian Processes, enabling to account for task correlations and supporting zero-shot prediction of learning curves (LCs). We demonstrate that this approach facilitates the development of probabilistic scaling laws at lower costs. Applying an active learning strategy, LCs can be queried to reduce predictive uncertainty and provide predictions close to ground truth scaling laws. We validate our framework on three small-scale NLP datasets with up to $30$ LCs. These are obtained from nanoGPT models, from bilingual translation using mBART and Transformer models, and from multilingual translation using M2M100 models of varying sizes.

98. 【2510.16727】Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

链接https://arxiv.org/abs/2510.16727

作者:Sanskar Pandey,Ruhaan Chopra,Angkul Puniya,Sohom Pal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, language models internalize, obsequious flattery, emerging from reward

备注

点击查看摘要

Abstract:Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

99. 【2510.16724】A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

链接https://arxiv.org/abs/2510.16724

作者:Minhua Lin,Zongyu Wu,Zhichao Xu,Hui Liu,Xianfeng Tang,Qi He,Charu Aggarwal,Hui Liu,Xiang Zhang,Suhang Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:open-ended natural language, large language models, transformed information access, natural language interaction, large language

备注: 38 pages, 4 figures, 7 tables

点击查看摘要

Abstract:The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at this https URL.

100. 【2510.16718】U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

链接https://arxiv.org/abs/2510.16718

作者:Xusheng Yang,Long Zhou,Wenfu Wang,Kai Hu,Shulin Feng,Chenxing Li,Meng Yu,Dong Yu,Yuexian Zou

类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:ltra low frame-rate, low frame-rate neural, extremely low frame-rate, low frame-rate, frame-rate neural speech

备注

点击查看摘要

Abstract:We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.

101. 【2510.16713】so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs

链接https://arxiv.org/abs/2510.16713

作者:Sriharsh Bhyravajjula,Melanie Walsh,Anna Preus,Maria Antoniak

类目:Computation and Language (cs.CL)

关键词:reflecting both adherence, critical component, adherence to standardized, Whitespace, standardized forms

备注

点击查看摘要

Abstract:Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.

102. 【2510.16712】he Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

链接https://arxiv.org/abs/2510.16712

作者:Shivam Ratnakar,Sanjay Raghavendra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Integration of Large, Large Language, retrieval engines, undermines their reliability

备注

点击查看摘要

Abstract:Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.

103. 【2510.16708】Natural Language Processing Applications in Cardiology: A Narrative Review

链接https://arxiv.org/abs/2510.16708

作者:Kailai Yang,Yan Leng,Xin Zhang,Tianlin Zhang,Paul Thompson,Bernard Keavney,Maciej Tomaszewski,Sophia Ananiadou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:health and well-being, NLP, prevalent in modern, modern society, significant effect

备注

点击查看摘要

Abstract:Cardiovascular disease has become increasingly prevalent in modern society and has a significant effect on global health and well-being. Heart-related conditions are intricate, multifaceted disorders, which may be influenced by a combination of genetic predispositions, lifestyle choices, and various socioeconomic and clinical factors. Information regarding these potentially complex interrelationships is dispersed among diverse types of textual data, which include patient narratives, medical records, and scientific literature, among others. Natural language processing (NLP) techniques have increasingly been adopted as a powerful means to analyse and make sense of this vast amount of unstructured data. This, in turn, can allow healthcare professionals to gain deeper insights into the cardiology field, which has the potential to revolutionize current approaches to the diagnosis, treatment, and prevention of cardiac problems. This review provides a detailed overview of NLP research in cardiology between 2014 and 2025. We queried six literature databases to find articles describing the application of NLP techniques in the context of a range of different cardiovascular diseases. Following a rigorous screening process, we identified a total of 265 relevant articles. We analysed each article from multiple dimensions, i.e., NLP paradigm types, cardiology-related task types, cardiovascular disease types, and data source types. Our analysis reveals considerable diversity within each of these dimensions, thus demonstrating the considerable breadth of NLP research within the field. We also perform a temporal analysis, which illustrates the evolution and changing trends in NLP methods employed over the last decade that we cover. To our knowledge, the review constitutes the most comprehensive overview of NLP research in cardiology to date.

104. 【2510.16686】Investigating the Impact of Rationales for LLMs on Natural Language Understanding

链接https://arxiv.org/abs/2510.16686

作者:Wenhang Shi,Shuqing Bian,Yiren Chen,Xinyi Zhang,Zhe Zhao,Pengfei Hu,Wei Lu,Xiaoyong Du

类目:Computation and Language (cs.CL)

关键词:derive final answers, NLU, derive final, NLU tasks, tasks

备注

点击查看摘要

Abstract:Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.

105. 【2510.16685】mporal Understanding under Deictic Frame of Reference

链接https://arxiv.org/abs/2510.16685

作者:Damin Zhang,Julia Rayz

类目:Computation and Language (cs.CL)

关键词:spatial metaphors grounded, sensory-motor experience, conceptualized through spatial, spatial metaphors, metaphors grounded

备注: Under review

点击查看摘要

Abstract:Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience. For example, "summer is approaching" parallels "We are approaching the summer". In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer's moment of "now". While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited. In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of "now" dynamically shifts along a timeline. Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points. Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events. The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.

106. 【2510.16670】All You Need is One: Capsule Prompt Tuning with a Single Vector

链接https://arxiv.org/abs/2510.16670

作者:Yiyang Liu,James C. Liang,Heng Fan,Wenhao Yang,Yiming Cui,Xiaotian Han,Lifu Huang,Dongfang Liu,Qifan Wang,Cheng Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:facilitate Large Language, facilitate Large, Large Language Model, Prompt-based learning, Large Language

备注: NeurIPS 2025

点击查看摘要

Abstract:Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03\% average accuracy on T5-Large), serving as an "attention anchor," while enjoying high parameter efficiency (e.g., 0.003\% of model parameters on Llama3.2-1B).

107. 【2510.16645】Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

链接https://arxiv.org/abs/2510.16645

作者:Zhixuan He,Yue Feng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Large Language Models, Large Language, demonstrate strong performance, Diverse Thinking Modes, lack interpretable reasoning

备注

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.

108. 【2510.16635】Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis

链接https://arxiv.org/abs/2510.16635

作者:Wonduk Seo,Juhyeon Lee,Junseo Koh,Hyunjin An,Jian Park,Seunghyun Lee,Haihua Chen,Yi Bu

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, Large Language, performance of Large, effective alternative

备注: Preprint

点击查看摘要

Abstract:Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.

109. 【2510.16604】Fine-tuning of Large Language Models for Constituency Parsing Using a Sequence to Sequence Approach

链接https://arxiv.org/abs/2510.16604

作者:Francisco Jose Cortes Delgado,Eduardo Martinez Gracia,Rafael Valencia Garcia

类目:Computation and Language (cs.CL)

关键词:natural language processing, Recent advances, syntactic analysis based, large neural models, machine learning

备注: 6 pages, 3 figures. Submitted to SEPLN 2023 Conference

点击查看摘要

Abstract:Recent advances in natural language processing with large neural models have opened new possibilities for syntactic analysis based on machine learning. This work explores a novel approach to phrase-structure analysis by fine-tuning large language models (LLMs) to translate an input sentence into its corresponding syntactic structure. The main objective is to extend the capabilities of MiSintaxis, a tool designed for teaching Spanish syntax. Several models from the Hugging Face repository were fine-tuned using training data generated from the AnCora-ES corpus, and their performance was evaluated using the F1 score. The results demonstrate high accuracy in phrase-structure analysis and highlight the potential of this methodology.

110. 【2510.16588】Copy-Augmented Representation for Structure Invariant Template-Free Retrosynthesis

链接https://arxiv.org/abs/2510.16588

作者:Jiaxi Zhuang,Yu Zhang,Aimin Zhou,Ying Qian

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Retrosynthesis prediction, requiring the identification, produce a target, chemical synthesis, Retrosynthesis

备注

点击查看摘要

Abstract:Retrosynthesis prediction is fundamental to drug discovery and chemical synthesis, requiring the identification of reactants that can produce a target molecule. Current template-free methods struggle to capture the structural invariance inherent in chemical reactions, where substantial molecular scaffolds remain unchanged, leading to unnecessarily large search spaces and reduced prediction accuracy. We introduce C-SMILES, a novel molecular representation that decomposes traditional SMILES into element-token pairs with five special tokens, effectively minimizing editing distance between reactants and products. Building upon this representation, we incorporate a copy-augmented mechanism that dynamically determines whether to generate new tokens or preserve unchanged molecular fragments from the product. Our approach integrates SMILES alignment guidance to enhance attention consistency with ground-truth atom mappings, enabling more chemically coherent predictions. Comprehensive evaluation on USPTO-50K and large-scale USPTO-FULL datasets demonstrates significant improvements: 67.2% top-1 accuracy on USPTO-50K and 50.8% on USPTO-FULL, with 99.9% validity in generated molecules. This work establishes a new paradigm for structure-aware molecular generation with direct applications in computational drug discovery.

111. 【2510.16573】AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu

链接https://arxiv.org/abs/2510.16573

作者:Muhammad Ammar,Hadiya Murad Hadi,Usman Majeed Butt

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, resembles human writing, closely resembles human, Large Language, making them powerful

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.

112. 【2510.16567】Hallucination Benchmark for Speech Foundation Models

链接https://arxiv.org/abs/2510.16567

作者:Alkis Koudounas,Moreno La Quatra,Manuel Giollo,Sabato Marco Siniscalchi,Elena Baralis

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:underlying acoustic input, coherent transcriptions produced, neural ASR models, systems refer, acoustic input

备注: Under Review

点击查看摘要

Abstract:Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.

113. 【2510.16565】Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models

链接https://arxiv.org/abs/2510.16565

作者:Seungho Cho,Changgeon Ko,Eui Jun Hwang,Junmyeong Lee,Huije Lee,Jong C. Park

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, making accurate cultural, diverse cultural contexts, Large language, making accurate

备注: Accepted to CIKM 2025 Workshop on Human Centric AI

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace LLMs' internal cultural understanding mechanisms by measuring activation path overlaps when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low overlap and high variability, showing that linguistic similarity does not guarantee aligned internal representation.

114. 【2510.16549】ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation

链接https://arxiv.org/abs/2510.16549

作者:Haoxuan Zhang,Ruochi Li,Sarthak Shrestha,Shree Harshini Mamidala,Revanth Putta,Arka Krishan Aggarwal,Ting Xiao,Junhua Ding,Haihua Chen

类目:Computation and Language (cs.CL)

关键词:large language models, Peer review serves, present unprecedented challenges, Peer review, gatekeeper of science

备注

点击查看摘要

Abstract:Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. Recent work has focused on using LLMs to improve review efficiency or generate insightful review content. However, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine the peer review ecosystem and compromise academic integrity. To address this critical issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews. ReviewGuard employs a comprehensive four-stage LLM-driven framework that: (1) collects ICLR and NeurIPS papers with their corresponding reviews from OpenReview; (2) annotates review types using GPT-4.1 with human validation; (3) addresses class imbalance and data scarcity through LLM-driven synthetic data augmentation, producing a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews; and (4) fine-tunes both encoder-based models and open source LLMs. We perform comprehensive feature analysis of the structure and quality of the review text. Compared to sufficient reviews, deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment. AI-generated text detection reveals that, since ChatGPT's emergence, AI-generated reviews have increased dramatically. In the evaluation of deficient review detection models, mixed training with synthetic and real review data provides substantial enhancements to recall and F1 scores on the binary task. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review while offering valuable insights into human-AI collaboration to maintain academic integrity.

115. 【2510.16499】Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

链接https://arxiv.org/abs/2510.16499

作者:Michelle Yuan,Khushbu Pahwa,Shuaichen Chang,Mustafa Kaba,Jiarong Jiang,Xiaofei Ma,Yi Zhang,Monica Sunkara

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Designing effective agentic, Designing effective, uncertain environments, requires the seamless, models within dynamic

备注: Accepted to NeurIPS 2025 Conference

点击查看摘要

Abstract:Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.

116. 【2510.16492】Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

链接https://arxiv.org/abs/2510.16492

作者:Vamshi Krishna Bonagiri,Ponnurangam Kumaragurum,Khanh Nguyen,Benjamin Plaut

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, agents increasingly operate, increasingly operate

备注: Reliable ML and Regulatable ML workshops, Neurips 2025

点击查看摘要

Abstract:As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.

117. 【2510.16458】Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

链接https://arxiv.org/abs/2510.16458

作者:Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Benjamin Roth,Barbara Plank

类目:Computation and Language (cs.CL)

关键词:Natural Language Inference, Language Inference datasets, Natural Language, Language Inference, exhibit human label

备注

点击查看摘要

Abstract:Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning types. However, previous work applying such taxonomies has focused on within-label variation: cases where annotators agree on the final NLI label but provide different explanations. In contrast, this paper broadens the scope by examining how annotators may diverge not only in the reasoning type but also in the labeling step. We use explanations as a lens to decompose the reasoning process underlying NLI annotation and to analyze individual differences. We apply LiTEx to two NLI English datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide highly similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning types better reflects the semantic similarity of free-text explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.

118. 【2510.16455】RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

链接https://arxiv.org/abs/2510.16455

作者:Deyi Ji,Yuekui Yang,Haiyang Wu,Shaoping Ma,Tianrun Chen,Lanyun Zhu

类目:Computation and Language (cs.CL)

关键词:ensuring platform compliance, existing methods struggle, video violation detection, Relative Policy Optimization, Group Relative Policy

备注: ACL 2025 (Oral, Industry Track)

点击查看摘要

Abstract:Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.

119. 【2510.16449】rajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

链接https://arxiv.org/abs/2510.16449

作者:Bin Yu,Xinming Wang,Shijie Lian,Haotian Li,Changti Wu,Ruina Hu,Bailing Wang,Yuliang Wei,Kai Chen

类目:Computation and Language (cs.CL)

关键词:Large language models, shown remarkable progress, allocate additional compute, complex reasoning tasks, Large language

备注: 13 pages, 6 figures. Project website: [this https URL](https://zgca-ai4edu.github.io/TrajSelector)

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.

120. 【2510.16439】FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

链接https://arxiv.org/abs/2510.16439

作者:Syed Rifat Raiyan,Md Farhan Ishmam,Abdullah Al Imran,Mohammad Ali Moni

类目:Computation and Language (cs.CL)

关键词:inflates monetary costs, Large language models, verbosity inflates monetary, Large language, carbon footprint

备注

点击查看摘要

Abstract:Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: this https URL

121. 【2510.16435】What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

链接https://arxiv.org/abs/2510.16435

作者:Lennart Wachowiak,Andrew Coles,Gerard Canal,Oya Celiktutan

类目:Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:large language models, questions, robots' ability, robots, dataset

备注

点击查看摘要

Abstract:With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (22.5%), the robot's capabilities (12.7%), and performance assessments (11.3%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.

122. 【2510.16387】Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

链接https://arxiv.org/abs/2510.16387

作者:Fu-An Chao,Bi-Cheng Yan,Berlin Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:automatic speech recognition, well-established automatic speech, spoken language assessment, explore the untapped, well-established automatic

备注

点击查看摘要

Abstract:In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.

123. 【2510.16381】ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents

链接https://arxiv.org/abs/2510.16381

作者:David Peer,Sebastian Stabinger

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrated impressive capabilities, Large Language Models, Large Language, including hallucinations, impressive capabilities

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness, including hallucinations, instability, and a lack of transparency. To address these challenges, we introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA). The core of our approach lies in decoupling tasks into two distinct phases: Offline knowledge ingestion and online task processing. During knowledge ingestion, an LLM translates an informal problem specification into a formal, symbolic knowledge base. This formal representation is crucial as it can be verified and refined by human experts, ensuring its correctness and alignment with domain requirements. In the subsequent task processing phase, each incoming input is encoded into the same formal language. A symbolic decision engine then utilizes this encoded input in conjunction with the formal knowledge base to derive a reliable result. Through an extensive evaluation on a complex reasoning task, we demonstrate that a concrete implementation of ATA is competitive with state-of-the-art end-to-end reasoning models in a fully automated setup while maintaining trustworthiness. Crucially, with a human-verified and corrected knowledge base, our approach significantly outperforms even larger models, while exhibiting perfect determinism, enhanced stability against input perturbations, and inherent immunity to prompt injection attacks. By generating decisions grounded in symbolic reasoning, ATA offers a practical and controllable architecture for building the next generation of transparent, auditable, and reliable autonomous agents.

124. 【2510.16380】MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

链接https://arxiv.org/abs/2510.16380

作者:Yu Ying Chiu,Michael S. Lee,Rachel Calcott,Brandon Handoko,Paul de Font-Reaulx,Paula Rodriguez,Chen Bo Calvin Zhang,Ziwen Han,Udari Madhushani Sehwag,Yash Maurya,Christina Q Knight,Harry R. Lloyd,Florence Bacus,Mantas Mazeika,Bing Liu,Yejin Choi,Mitchell L Gordon,Sydney Levine

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:systems progress, decisions, moral, Reasoning, moral decisions

备注: 46 pages, 8 figures, 10 tables. Preprint

点击查看摘要

Abstract:As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

125. 【2510.16373】Navigating through the hidden embedding space: steering LLMs to improve mental health assessment

链接https://arxiv.org/abs/2510.16373

作者:Federico Ravenda,Seyed Ali Bahrainian,Andrea Raballo,Antonietta Mira

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Mental Health, Large Language, evolution of Large, opening new opportunities

备注

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) is transforming AI, opening new opportunities in sensitive and high-impact areas such as Mental Health (MH). Yet, despite these advancements, recent evidence reveals that smaller-scale models still struggle to deliver optimal performance in domain-specific applications. In this study, we present a cost-efficient yet powerful approach to improve MH assessment capabilities of an LLM, without relying on any computationally intensive techniques. Our lightweight method consists of a linear transformation applied to a specific layer's activations, leveraging steering vectors to guide the model's output. Remarkably, this intervention enables the model to achieve improved results across two distinct tasks: (1) identifying whether a Reddit post is useful for detecting the presence or absence of depressive symptoms (relevance prediction task), and (2) completing a standardized psychological screening questionnaire for depression based on users' Reddit post history (questionnaire completion task). Results highlight the untapped potential of steering mechanisms as computationally efficient tools for LLMs' MH domain adaptation.

126. 【2510.16363】End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction

链接https://arxiv.org/abs/2510.16363

作者:Nilmadhab Das,Vishal Vaibhav,Yash Sunil Choudhary,V. Vijaya Saradhi,Ashish Anand

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Argument Mining, Argument Components, Claim etc., Argumentative Relations, complex argumentative structures

备注: Accepted version. To appear in IJCNN 2025

点击查看摘要

Abstract:Argument Mining (AM) helps in automating the extraction of complex argumentative structures such as Argument Components (ACs) like Premise, Claim etc. and Argumentative Relations (ARs) like Support, Attack etc. in an argumentative text. Due to the inherent complexity of reasoning involved with this task, modelling dependencies between ACs and ARs is challenging. Most of the recent approaches formulate this task through a generative paradigm by flattening the argumentative structures. In contrast to that, this study jointly formulates the key tasks of AM in an end-to-end fashion using Autoregressive Argumentative Structure Prediction (AASP) framework. The proposed AASP framework is based on the autoregressive structure prediction framework that has given good performance for several NLP tasks. AASP framework models the argumentative structures as constrained pre-defined sets of actions with the help of a conditional pre-trained language model. These actions build the argumentative structures step-by-step in an autoregressive manner to capture the flow of argumentative reasoning in an efficient way. Extensive experiments conducted on three standard AM benchmarks demonstrate that AASP achieves state-of-theart (SoTA) results across all AM tasks in two benchmarks and delivers strong results in one benchmark.

127. 【2510.16359】Utilising Large Language Models for Generating Effective Counter Arguments to Anti-Vaccine Tweets

链接https://arxiv.org/abs/2510.16359

作者:Utsav Dhanuka,Soham Poddar,Saptarshi Ghosh

类目:Computation and Language (cs.CL)

关键词:critical societal goal, combatting vaccine skepticism, social media, societal goal, era where public

备注: 14 pages, 1 figure, work done as a part of [this http URL](http://B.Tech) project at IIT Kharagpur

点击查看摘要

Abstract:In an era where public health is increasingly influenced by information shared on social media, combatting vaccine skepticism and misinformation has become a critical societal goal. Misleading narratives around vaccination have spread widely, creating barriers to achieving high immunisation rates and undermining trust in health recommendations. While efforts to detect misinformation have made significant progress, the generation of real time counter-arguments tailored to debunk such claims remains an insufficiently explored area. In this work, we explore the capabilities of LLMs to generate sound counter-argument rebuttals to vaccine misinformation. Building on prior research in misinformation debunking, we experiment with various prompting strategies and fine-tuning approaches to optimise counter-argument generation. Additionally, we train classifiers to categorise anti-vaccine tweets into multi-labeled categories such as concerns about vaccine efficacy, side effects, and political influences allowing for more context aware rebuttals. Our evaluation, conducted through human judgment, LLM based assessments, and automatic metrics, reveals strong alignment across these methods. Our findings demonstrate that integrating label descriptions and structured fine-tuning enhances counter-argument effectiveness, offering a promising approach for mitigating vaccine misinformation at scale.

128. 【2510.16340】hinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

链接https://arxiv.org/abs/2510.16340

作者:Pratham Singla,Shivank Garg,Ayush Singh,Ishan Garg,Ketan Suhaas Saichandran

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:endowed Large Language, Large Language Models, Large Language, supplementary planning tokens, Recent advances

备注

点击查看摘要

Abstract:Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.

129. 【2510.16334】Investigating the Association Between Text-Based Indications of Foodborne Illness from Yelp Reviews and New York City Health Inspection Outcomes (2023)

链接https://arxiv.org/abs/2510.16334

作者:Eden Shaveet,Crystal Su,Daniel Hsu,Luis Gravano

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:gastrointestinal conditions caused, consuming contaminated food, Foodborne illnesses, illnesses are gastrointestinal, gastrointestinal conditions

备注: Presented as a poster at Data Science Day 2024

点击查看摘要

Abstract:Foodborne illnesses are gastrointestinal conditions caused by consuming contaminated food. Restaurants are critical venues to investigate outbreaks because they share sourcing, preparation, and distribution of foods. Public reporting of illness via formal channels is limited, whereas social media platforms host abundant user-generated content that can provide timely public health signals. This paper analyzes signals from Yelp reviews produced by a Hierarchical Sigmoid Attention Network (HSAN) classifier and compares them with official restaurant inspection outcomes issued by the New York City Department of Health and Mental Hygiene (NYC DOHMH) in 2023. We evaluate correlations at the Census tract level, compare distributions of HSAN scores by prevalence of C-graded restaurants, and map spatial patterns across NYC. We find minimal correlation between HSAN signals and inspection scores at the tract level and no significant differences by number of C-graded restaurants. We discuss implications and outline next steps toward address-level analyses.

130. 【2510.16290】Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models

链接https://arxiv.org/abs/2510.16290

作者:Yue Zheng,Xiufang Shi,Jiming Chen,Yuanchao Shu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:rapidly advanced, advanced by recent, recent development, development of Vision-Language, Vision-Language Models

备注

点击查看摘要

Abstract:Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.

131. 【2510.16282】Instant Personalized Large Language Model Adaptation via Hypernetwork

链接https://arxiv.org/abs/2510.16282

作者:Zhaoxuan Tan,Zixuan Zhang,Haoyang Wen,Zheng Li,Rongzhi Zhang,Pei Chen,Fengran Mo,Zheyuan Liu,Qingkai Zeng,Qingyu Yin,Meng Jiang

类目:Computation and Language (cs.CL)

关键词:Personalized large language, large language models, Personalized large, language models, tailor content

备注

点击查看摘要

Abstract:Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.

132. 【2510.16257】owards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback

链接https://arxiv.org/abs/2510.16257

作者:Chu Fei Luo,Samuel Dahan,Xiaodan Zhu

类目:Computation and Language (cs.CL)

关键词:impact on society, language models, greater impact, important to ensure, diverse range

备注: Findings of EMNLP 2025, 5 pages

点击查看摘要

Abstract:As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.

133. 【2510.16252】WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

链接https://arxiv.org/abs/2510.16252

作者:Yuxuan Lu,Jing Huang,Hui Liu,Jiri Gesi,Yan Han,Shihan Fu,Tianqi Zheng,Dakuo Wang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:gained increasing attention, robust browser-side interaction, controllable server-side state, Reinforcement Learning, web agents

备注

点击查看摘要

Abstract:Training and evaluation of Reinforcement Learning (RL) web agents have gained increasing attention, yet a scalable and efficient environment that couples realistic and robust browser-side interaction with controllable server-side state at scale is still missing. Existing environments tend to have one or more of the following issues: they overwhelm policy models with excessive and noisy context; they perform actions non-deterministically without waiting for the UI or network to stabilize; or they cannot scale isolated client-server containers effectively for parallel RL rollouts. We propose WEBSERV, an environment that includes 1) a compact, site-agnostic browser environment that balances context and action complexity, and 2) a scalable RL environment via efficient launching and resetting web-servers to enable scalable RL training and evaluation. We evaluate WEBSERV on the shopping CMS and Gitlab tasks in WebArena, achieving state-of-the-art single-prompt success rates while cutting launch latency by ~5x and storage need by ~240x, with a comparable memory footprint, enabling 200+ concurrent containers on a single host.

134. 【2510.16234】ScholarEval: Research Idea Evaluation Grounded in Literature

链接https://arxiv.org/abs/2510.16234

作者:Hanane Nour Moussa,Patrick Queiroz Da Silva,Daniel Adu-Ampratwum,Alyson East,Zitong Lu,Nikki Puccetti,Mingyi Xue,Huan Sun,Bodhisattwa Prasad Majumder,Sachin Kumar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:increasingly common, critical to ensure, research ideas based, research ideation, generated ideas

备注

点击查看摘要

Abstract:As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas. We introduce ScholarEval, a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness - the empirical validity of proposed methods based on existing literature, and contribution - the degree of advancement made by the idea across different dimensions relative to prior research. To evaluate ScholarEval, we introduce ScholarIdeas, the first expert-annotated dataset of multi-domain research ideas and reviews, comprised of 117 ideas across four disciplines: artificial intelligence, neuroscience, biochemistry, and ecology. Our evaluation shows that ScholarEval achieves significantly higher coverage of points mentioned in the human expert annotated rubrics in ScholarIdeas compared to all baselines. Furthermore, ScholarEval is consistently preferred over our strongest baseline o4-mini-deep-research, a reasoning and search-enabled agentic system by OpenAI, in terms of evaluation actionability, depth, and evidence support. Our large-scale user study also shows that ScholarEval significantly outperforms deep research in literature engagement, idea refinement, and usefulness. We openly release our code, dataset, and ScholarEval tool for the community to use and build on.

135. 【2510.16227】What Can String Probability Tell Us About Grammaticality?

链接https://arxiv.org/abs/2510.16227

作者:Jennifer Hu,Ethan Gotlieb Wilcox,Siyuan Song,Kyle Mahowald,Roger P. Levy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:language models, probability, learned about grammar, Abstract, LMs

备注

点击查看摘要

Abstract:What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.

136. 【2510.16198】EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

链接https://arxiv.org/abs/2510.16198

作者:Mohamed Gamil,Abdelrahman Elsayed,Abdelrahman Lila,Ahmed Gad,Hesham Abdelgawad,Mohamed Aref,Ahmed Fares

类目:Computation and Language (cs.CL)

关键词:East and Africa, Middle East, recent advances, culturally diverse datasets, multimodal culturally diverse

备注

点击查看摘要

Abstract:Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.

137. 【2510.16173】In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions

链接https://arxiv.org/abs/2510.16173

作者:Aria Pessianzadeh,Naima Sultana,Hildegarde Van den Bulck,David Gefen,Shahin Jabari,Rezvaneh Rezapour

类目:Computation and Language (cs.CL)

关键词:human life, rise of generative, impacted many aspects, aspects of human, trust

备注

点击查看摘要

Abstract:The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.

138. 【2510.16167】Alignment is Localized: A Causal Probe into Preference Layers

链接https://arxiv.org/abs/2510.16167

作者:Archie Chaudhury

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Reinforcement Learning frameworks, Reinforcement Learning, increasingly popular method, Learning frameworks, policies or guidelines

备注

点击查看摘要

Abstract:Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral policies or guidelines. Reinforcement Learning through Human Feedback (RLHF) is perhaps the most popular implementation of such a framework, particularly for aligning LMs toward safety and human intent. However, the internal workings of how such alignment is achieved remain largely opaque. In this work, we systematically analyze preference optimization for language model alignment by applying layer-wide causal patching between a base model and its tuned counterpart across human preference pairs. We implement our methodology on \textit{Llama-3.2-1B}, and find that alignment is spatially localized: mid-layer activations encode a distinct subspace that causally determines reward-consistent behavior, while early and late layers remain largely unaffected. Utilizing LASSO regression, we also find that only a small number of layers possess non-zero coefficients linking activation distances to reward gains. Overall, we show that, at least for some language models, alignment from human-based, preferential tuning is a directional, low rank process, rather than diffuse and parameteric.

139. 【2510.16157】Zeroth-Order Sharpness-Aware Learning with Exponential Tilting

链接https://arxiv.org/abs/2510.16157

作者:Xuchen Gong,Tian Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:randomly perturbed model, Classic zeroth-order optimization, perturbed model parameters, Classic zeroth-order, original function

备注

点击查看摘要

Abstract:Classic zeroth-order optimization approaches typically optimize for a smoothed version of the original function, i.e., the expected objective under randomly perturbed model parameters. This can be interpreted as encouraging the loss values in the perturbation set to be small on average. Popular sharpness-aware minimization (SAM) objectives, however, typically focus on the largest loss within the neighborhood to arrive at flat minima more effectively. In this work, we connect zeroth-order optimization (and its corresponding objectives) with SAM approaches explicitly, through an exponential tilting objective that provides a smooth transition between the average- and the max-loss formulations. We explore new zeroth-order algorithms to solve a soft SAM objective parameterized by a tilting parameter $t$. We provide precise characterizations of the sharpness notions of the tilted SAM framework. Practically, our approach can be used as a gradient-free and memory-efficient alternative to SAM variants, and it achieves better generalization compared to vanilla zeroth-order baselines on a wide range of downstream tasks, including classification, multiple choice QA, and language generation.

140. 【2510.16152】Publication Trend Analysis and Synthesis via Large Language Model: A Case Study of Engineering in PNAS

链接https://arxiv.org/abs/2510.16152

作者:Mason Smetana,Lev Khazanovich

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:sparse keyword systems, static disciplinary structures, potentially sparse keyword, static disciplinary, making it cumbersome

备注: 35 pages, 10 figures

点击查看摘要

Abstract:Scientific literature is increasingly siloed by complex language, static disciplinary structures, and potentially sparse keyword systems, making it cumbersome to capture the dynamic nature of modern science. This study addresses these challenges by introducing an adaptable large language model (LLM)-driven framework to quantify thematic trends and map the evolving landscape of scientific knowledge. The approach is demonstrated over a 20-year collection of more than 1,500 engineering articles published by the Proceedings of the National Academy of Sciences (PNAS), marked for their breadth and depth of research focus. A two-stage classification pipeline first establishes a primary thematic category for each article based on its abstract. The subsequent phase performs a full-text analysis to assign secondary classifications, revealing latent, cross-topic connections across the corpus. Traditional natural language processing (NLP) methods, such as Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), confirm the resulting topical structure and also suggest that standalone word-frequency analyses may be insufficient for mapping fields with high diversity. Finally, a disjoint graph representation between the primary and secondary classifications reveals implicit connections between themes that may be less apparent when analyzing abstracts or keywords alone. The findings show that the approach independently recovers much of the journal's editorially embedded structure without prior knowledge of its existing dual-classification schema (e.g., biological studies also classified as engineering). This framework offers a powerful tool for detecting potential thematic trends and providing a high-level overview of scientific progress.

141. 【2510.16122】he Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers

链接https://arxiv.org/abs/2510.16122

作者:Owais Makroo,Siva Rajesh Kasa,Sumegh Roychowdhury,Karan Gupta,Nikhil Pattisapu,Santhosh Kasa,Sumit Negi

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:Membership Inference Attacks, Inference Attacks, threat by enabling, enabling adversaries, adversaries to determine

备注

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset. Despite extensive research on MIAs, systematic comparisons between generative and discriminative classifiers remain limited. This work addresses this gap by first providing theoretical motivation for why generative classifiers exhibit heightened susceptibility to MIAs, then validating these insights through comprehensive empirical evaluation. Our study encompasses discriminative, generative, and pseudo-generative text classifiers across varying training data volumes, evaluated on nine benchmark datasets. Employing a diverse array of MIA strategies, we consistently demonstrate that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage. Furthermore, we observe that the canonical inference approach commonly used in generative classifiers significantly amplifies this privacy risk. These findings reveal a fundamental utility-privacy trade-off inherent in classifier design, underscoring the critical need for caution when deploying generative classifiers in privacy-sensitive applications. Our results motivate future research directions in developing privacy-preserving generative classifiers that can maintain utility while mitigating membership inference vulnerabilities.

142. 【2510.16096】Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization

链接https://arxiv.org/abs/2510.16096

作者:Tina Behnia,Puneesh Deora,Christos Thrampoulidis

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:making text fluent, blend statistical regularities, Language models, making text, text fluent

备注: 28 pages, 15 figures

点击查看摘要

Abstract:Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.

143. 【2510.16091】Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification

链接https://arxiv.org/abs/2510.16091

作者:Binglan Han,Anuradha Mathrani,Teo Susnjak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:prompting strategies interact, systematic literature reviews, study quantifies, quantifies how prompting, prompting strategies

备注

点击查看摘要

Abstract:This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types (zero-shot, few-shot, chain-of-thought (CoT), CoT-few-shot, self-reflection) across relevance classification and six Level-2 tasks, using accuracy, precision, recall, and F1. Results show pronounced model-prompt interaction effects: CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models. GPT-4o and DeepSeek provide robust overall performance, while GPT-4o-mini performs competitively at a substantially lower dollar cost. A cost-performance analysis for relevance classification (per 1,000 abstracts) reveals large absolute differences among model-prompt pairings; GPT-4o-mini remains low-cost across prompts, and structured prompts (CoT/CoT-few-shot) on GPT-4o-mini offer attractive F1 at a small incremental cost. We recommend a staged workflow that (1) deploys low-cost models with structured prompts for first-pass screening and (2) escalates only borderline cases to higher-capacity models. These findings highlight LLMs' uneven but promising potential to automate literature screening. By systematically analyzing prompt-model interactions, we provide a comparative benchmark and practical guidance for task-adaptive LLM deployment.

144. 【2510.16079】EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

链接https://arxiv.org/abs/2510.16079

作者:Rong Wu,Xiaoman Wang,Jianbiao Mei,Pinlong Cai,Daocheng Fu,Cheng Yang,Licheng Wen,Xuemeng Yang,Yufan Shen,Yuxin Wang,Botian Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model, Current Large Language, Current Large, Language Model, Large Language

备注

点击查看摘要

Abstract:Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at this https URL.

145. 【2510.16062】Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

链接https://arxiv.org/abs/2510.16062

作者:Guiyao Tie,Zenghui Yuan,Zeli Zhao,Chaoran Hu,Tianhe Gu,Ruihang Zhang,Sizhe Zhang,Junran Wu,Xiaoyue Tu,Ming Jin,Qingsong Wen,Lixing Chen,Pan Zhou,Lichao Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, language models, Self-correction, large language, critical component

备注: 38 pages, 25 figures, 8 tables

点击查看摘要

Abstract:Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: this https URL

146. 【2510.16059】SIADAFIX: issue description response for adaptive program repair

链接https://arxiv.org/abs/2510.16059

作者:Xin Cao,Nan Yu

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:large language model-based, language model-based agents, propose utilizing fast, issue description response, program repair

备注: 20 pages, 3 figures

点击查看摘要

Abstract:We propose utilizing fast and slow thinking to enhance the capabilities of large language model-based agents on complex tasks such as program repair. In particular, we design an adaptive program repair method based on issue description response, called SIADAFIX. The proposed method utilizes slow thinking bug fix agent to complete complex program repair tasks, and employs fast thinking workflow decision components to optimize and classify issue descriptions, using issue description response results to guide the orchestration of bug fix agent workflows. SIADAFIX adaptively selects three repair modes, i.e., easy, middle and hard mode, based on problem complexity. It employs fast generalization for simple problems and test-time scaling techniques for complex problems. Experimental results on the SWE-bench Lite show that the proposed method achieves 60.67% pass@1 performance using the Claude-4 Sonnet model, reaching state-of-the-art levels among all open-source methods. SIADAFIX effectively balances repair efficiency and accuracy, providing new insights for automated program repair. Our code is available at this https URL.

147. 【2510.16057】Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus

链接https://arxiv.org/abs/2510.16057

作者:Md Kamrul Siam,Md Jobair Hossain Faruk,Jerry Q. Cheng,Huanying Gu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词

备注: 7 pages (Accepted to IEEE BHI 2025)

点击查看摘要

None

148. 【2510.16054】PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

链接https://arxiv.org/abs/2510.16054

作者:Zheng Hui,Yijiang River Dong,Sanhanat Sivapiromrat,Ehsan Shareghi,Nigel Collier

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large Language Models, risk data exposure, Large Language, users submit queries, powerful proprietary LLM

备注

点击查看摘要

Abstract:When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called PrivacyPAD to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments.

149. 【2510.16017】InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects

链接https://arxiv.org/abs/2510.16017

作者:Ibrahim Sheikh Mohamed,Abdullah Yahya Abdullah Omaisan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:closed circuit television, Infrastructure in smart, circuit television, smart cities, cities is increasingly

备注

点击查看摘要

Abstract:Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.

150. 【2510.15990】Can GRPO Help LLMs Transcend Their Pretraining Origin?

链接https://arxiv.org/abs/2510.15990

作者:Kangqi Ni,Zhen Tan,Zijie Liu,Pingzhi Li,Tianlong Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词

备注

点击查看摘要

None

151. 【2510.15977】Bolster Hallucination Detection via Prompt-Guided Data Augmentation

链接https://arxiv.org/abs/2510.15977

作者:Wenyun Li,Zheng Zhang,Dongmei Jiang,Xiangyuan Lan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词

备注

点击查看摘要

None

152. 【2510.15972】Quantum NLP models on Natural Language Inference

链接https://arxiv.org/abs/2510.15972

作者:Ling Sun,Peter Sullivan,Michael Martin,Yun Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language processing, embedding compositional structure, compositional structure directly, Natural Language Inference, Quantum natural language

备注: Accepted, presented, and to appear in the Proceedings of the Quantum AI and NLP 2025 Conference

点击查看摘要

Abstract:Quantum natural language processing (QNLP) offers a novel approach to semantic modeling by embedding compositional structure directly into quantum circuits. This paper investigates the application of QNLP models to the task of Natural Language Inference (NLI), comparing quantum, hybrid, and classical transformer-based models under a constrained few-shot setting. Using the lambeq library and the DisCoCat framework, we construct parameterized quantum circuits for sentence pairs and train them for both semantic relatedness and inference classification. To assess efficiency, we introduce a novel information-theoretic metric, Information Gain per Parameter (IGPP), which quantifies learning dynamics independent of model size. Our results demonstrate that quantum models achieve performance comparable to classical baselines while operating with dramatically fewer parameters. The Quantum-based models outperform randomly initialized transformers in inference and achieve lower test error on relatedness tasks. Moreover, quantum models exhibit significantly higher per-parameter learning efficiency (up to five orders of magnitude more than classical counterparts), highlighting the promise of QNLP in low-resource, structure-sensitive settings. To address circuit-level isolation and promote parameter sharing, we also propose a novel cluster-based architecture that improves generalization by tying gate parameters to learned word clusters rather than individual tokens.

153. 【2510.15964】Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity

链接https://arxiv.org/abs/2510.15964

作者:Tuowei Wang,Kun Li,Zixu Hao,Donglin Bai,Ju Ren,Yaoxue Zhang,Ting Cao,Mao Yang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词

备注

点击查看摘要

None

154. 【2510.15951】Attention to Non-Adopters

链接https://arxiv.org/abs/2510.15951

作者:Kaitlyn Zhou,Kristina Gligorić,Myra Cheng,Michelle S. Lam,Vyoma Raman,Boluwatife Aminu,Caeley Woo,Michael Brockman,Hannah Cha,Dan Jurafsky

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词

备注

点击查看摘要

None

155. 【2510.15898】HealthDial: A No-Code LLM-Assisted Dialogue Authoring Tool for Healthcare Virtual Agents

链接https://arxiv.org/abs/2510.15898

作者:Farnaz Nouraei,Zhuorui Yong,Timothy Bickmore

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词

备注

点击查看摘要

None

156. 【2510.15891】Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System

链接https://arxiv.org/abs/2510.15891

作者:Ziv Ben-Zion,Paul Raffelhüschen,Max Zettl,Antonia Lüönd,Achim Burrer,Philipp Homan,Tobias R Spiller

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词

备注

点击查看摘要

None

157. 【2510.15889】Mitigating Harmful Erraticism in LLMs Through Dialectical Behavior Therapy Based De-Escalation Strategies

链接https://arxiv.org/abs/2510.15889

作者:Pooja Rangarajan,Jacob Boyle

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词

备注: 15 pages, 7 figures and 6 tables

点击查看摘要

None

158. 【2510.13939】Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

链接https://arxiv.org/abs/2510.13939

作者:Tuhin Chakrabarty,Jane C. Ginsburg,Paramveer Dhillon

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词

备注: Preprint Under Review

点击查看摘要

None

159. 【2510.15929】Comparing LLMs for Sentiment Analysis in Financial Market News

链接https://arxiv.org/abs/2510.15929

作者:Lucas Eduardo Pereira Teles,Carlos M. S. Figueiredo

类目:atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词

备注

点击查看摘要

None

信息检索

1. 【2510.17670】On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration

链接https://arxiv.org/abs/2510.17670

作者:Yehonathan Refael,Amit Aides,Aviad Barzilai,George Leifman,Genady Beryozkin,Vered Silverman,Bolous Jaber,Tomer Shekel

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:arbitrary text queries, offer remarkable flexibility, Open-vocabulary object detection, models offer remarkable, text queries

备注

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery this http URL core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art this http URL method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.

2. 【2510.17614】OG-Rank: Learning to Rank Fast and Slow with Uncertainty and Reward-Trend Guided Adaptive Exploration

链接https://arxiv.org/abs/2510.17614

作者:Praphul Singh,Corey Barrett,Sumana Srivasta,Irfan Bulu,Sri Gadde,Krishnaram Kenthapadi

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Clinicians need ranking, justify their choices, ranking systems, systems that work, work in real

备注

点击查看摘要

Abstract:Clinicians need ranking systems that work in real time and still justify their choices. Motivated by the need for a low-latency, decoder-based reranker, we present OG-Rank, a single-decoder approach that pairs a pooled first-token scoring signal with an uncertainty-gated explanation step. The model scores all candidates in one pass and generates a brief, structured rationale only when the list is genuinely ambiguous, keeping latency predictable. Trained with a curriculum that concentrates effort on hard cases, OG-Rank delivers strong effectiveness on encounter-scoped order selection (fast path: Recall@1~0.45, nDCG@20~0.625) and improves further when the gate activates (Recall@1~0.56, nDCG@20~0.699 at a 45\% gate rate), while compact backbones show similar gains under the same policy. Encoder baselines trail in both effectiveness and flexibility. The result is a practical recipe: rank fast by default and explain when it helps, a pattern that applies broadly to decision tasks where selective generation buys accuracy at acceptable cost. The single-policy design simplifies deployment and budget planning, and the curriculum principle (spend more on the hard cases, less on the easy ones) readily transfers beyond clinical order selection.

3. 【2510.17535】How role-play shapes relevance judgment in zero-shot LLM rankers

链接https://arxiv.org/abs/2510.17535

作者:Yumeng Wang,Jirui Qi,Catherine Chen,Panagiotis Eustratiadis,Suzan Verberne

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, Large Language, emerged as promising, performance is highly

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.

4. 【2510.17354】owards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.17354

作者:Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:enhancing large language, large language models, external corpus, powerful paradigm, paradigm for enhancing

备注: This work is in progress

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

5. 【2510.17281】MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

链接https://arxiv.org/abs/2510.17281

作者:Qingyao Ai,Yichen Tang,Changyue Wang,Jianming Long,Weihang Su,Yiqun Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:computational resource consumption, marginal gains obtained, larger computational resource, improve LLM systems, Scaling up data

备注

点击查看摘要

Abstract:Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.

6. 【2510.17245】On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders

链接https://arxiv.org/abs/2510.17245

作者:Wenyu Mao,Jiancan Wu,Guoqing Hu,Wei Ji,Xiang Wang

类目:Information Retrieval (cs.IR)

关键词:generative sequential recommendation, user interaction histories, Diffusion models, multi-step denoising process, powerful paradigm

备注

点击查看摘要

Abstract:Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effectiveness. To address this trade-off, we propose TA-Rec, a two-stage framework that achieves one-step generation by smoothing the denoising function during pretraining while alleviating trajectory deviation by aligning with user preferences during fine-tuning. Specifically, to improve the efficiency without sacrificing the recommendation performance, TA-Rec pretrains the denoising model with Temporal Consistency Regularization (TCR), enforcing the consistency between the denoising results across adjacent steps. Thus, we can smooth the denoising function to map the noise as oracle items in one step with bounded error. To further enhance effectiveness, TA-Rec introduces Adaptive Preference Alignment (APA) that aligns the denoising process with user preference adaptively based on preference pair similarity and timesteps. Extensive experiments prove that TA-Rec's two-stage objective effectively mitigates the discretization errors-induced trade-off, enhancing both efficiency and effectiveness of diffusion-based recommenders.

7. 【2510.17228】DSEBench: A Test Collection for Explainable Dataset Search with Examples

链接https://arxiv.org/abs/2510.17228

作者:Qing Shi,Jing He,Qiaosheng Chen,Gong Cheng

类目:Information Retrieval (cs.IR)

关键词:Dataset search, established information retrieval, Explainable DSE, Dataset, input target dataset

备注: 34 pages, 5 figures, submitted to Knowledge-Based Systems

点击查看摘要

Abstract:Dataset search has been an established information retrieval task. Current paradigms either retrieve datasets that are relevant to a keyword query or find datasets that are similar to an input target dataset. To allow for their combined specification of information needs, in this article, we investigate the more generalized task of Dataset Search with Examples (DSE) and further extend it to Explainable DSE that requires identifying the metadata and content fields of a dataset that indicate its relevance to the query and similarity to the target datasets. To facilitate this research, we construct DSEBench, a test collection that provides high-quality dataset- and field-level annotations to enable the evaluation of explainable DSE. We also employ a large language model to generate numerous annotations to be used for training. We establish extensive baselines on DSEBench by adapting and evaluating a variety of sparse, dense, and LLM-based retrieval, reranking, and explanation methods.

8. 【2510.17139】Rethinking On-policy Optimization for Query Augmentation

链接https://arxiv.org/abs/2510.17139

作者:Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Recent advances, large language models, advances in large, large language, surge of interest

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.

9. 【2510.16925】owards Context-aware Reasoning-enhanced Generative Searching in E-commerce

链接https://arxiv.org/abs/2510.16925

作者:Zhiding Liu,Ben Chen,Mingyue Cheng,Enchong Chen,Li Li,Chenyi Lei,Wenwu Ou,Han Li,Kun Gai

类目:Information Retrieval (cs.IR)

关键词:critical application scenarios, critical application, current query information, search, application scenarios

备注

点击查看摘要

Abstract:Search-based recommendation is one of the most critical application scenarios in e-commerce platforms. Users' complex search contexts--such as spatiotemporal factors, historical interactions, and current query's information--constitute an essential part of their decision-making, reflecting implicit preferences that complement explicit query terms. Modeling such rich contextual signals and their intricate associations with candidate items remains a key challenge. Although numerous efforts have been devoted to building more effective search methods, existing approaches still show limitations in integrating contextual information, which hinders their ability to fully capture user intent. To address these challenges, we propose a context-aware reasoning-enhanced generative search framework for better \textbf{understanding the complicated context}. Specifically, the framework first unifies heterogeneous user and item contexts into textual representations or text-based semantic identifiers and aligns them. To overcome the lack of explicit reasoning trajectories, we introduce a self-evolving post-training paradigm that iteratively combines supervised fine-tuning and reinforcement learning to progressively enhance the model's reasoning capability. In addition, we identify potential biases in existing RL algorithms when applied to search scenarios and present a debiased variant of GRPO to improve ranking performance. Extensive experiments on search log data collected from a real-world e-commerce platform demonstrate that our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2510.16925 [cs.IR]

(or
arXiv:2510.16925v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.16925

Focus to learn more

              arXiv-issued DOI via DataCite</p>
10. 【2510.16804】he Layout Is the Model: On Action-Item Coupling in Generative Recommendation

链接https://arxiv.org/abs/2510.16804

作者:Xiaokai Wei,Jiajun Wu,Daiyao Yi,Reza Shirkavand,Michelle Gong

类目:Information Retrieval (cs.IR)

关键词:Generative Recommendation, user interaction history, autoregressively predicted, treat a user, user interaction

备注

点击查看摘要

Abstract:Generative Recommendation (GR) models treat a user's interaction history as a sequence to be autoregressively predicted. When both items and actions (e.g., watch time, purchase, comment) are modeled, the layout-the ordering and visibility of item/action tokens-critically determines what information the model can use and how it generalizes. We present a unified study of token layouts for GR grounded in first principles: (P1) maximize item/action signal in both input/output space, (P2) preserve the conditioning relationship "action given item" and (P3) no information leakage. While interleaved layout (where item and action occupy separate tokens) naturally satisfies these principles, it also bloats sequence length with larger training/inference cost. On the non-interleaved front, we design a novel and effective approach, Lagged Action Conditioning (LAC), which appears strange on the surface but aligns well with the design principles to yield strong accuracy. Comprehensive experiments on public datasets and large-scale production logs evaluate different layout options and empirically verifies the design principles. Our proposed non-interleaved method, LAC, achieves competitive or superior quality at substantially lower FLOPs than interleaving. Our findings offer actionable guidance for assembling GR systems that are both accurate and efficient.

Subjects:

Information Retrieval (cs.IR)

ACMclasses:
H.3.3

Cite as:
arXiv:2510.16804 [cs.IR]

(or
arXiv:2510.16804v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.16804

Focus to learn more

              arXiv-issued DOI via DataCite</p>
11. 【2510.16803】An Efficient Framework for Whole-Page Reranking via Single-Modal Supervision

链接https://arxiv.org/abs/2510.16803

作者:Zishuai Zhang,Sihao Yu,Wenyi Xie,Ying Nie,Junfeng Wang,Zhiming Zheng,Dawei Yin,Hainan Zhang

类目:Information Retrieval (cs.IR)

关键词:integrates retrieval results, LLM outputs, plays a critical, critical role, role in shaping

备注

点击查看摘要

Abstract:The whole-page reranking plays a critical role in shaping the user experience of search engines, which integrates retrieval results from multiple modalities, such as documents, images, videos, and LLM outputs. Existing methods mainly rely on large-scale human-annotated data, which is costly to obtain and time-consuming. This is because whole-page annotation is far more complex than single-modal: it requires assessing the entire result page while accounting for cross-modal relevance differences. Thus, how to improve whole-page reranking performance while reducing annotation costs is still a key challenge in optimizing search engine result pages(SERP). In this paper, we propose SMAR, a novel whole-page reranking framework that leverages strong Single-modal rankers to guide Modal-wise relevance Alignment for effective Reranking, using only limited whole-page annotation to outperform fully-annotated reranking models. Specifically, high-quality single-modal rankers are first trained on data specific to their respective modalities. Then, for each query, we select a subset of their outputs to construct candidate pages and perform human annotation at the page level. Finally, we train the whole-page reranker using these limited annotations and enforcing consistency with single-modal preferences to maintain ranking quality within each modality. Experiments on the Qilin and Baidu datasets demonstrate that SMAR reduces annotation costs by about 70-90\% while achieving significant ranking improvements compared to baselines. Further offline and online A/B testing on Baidu APPs also shows notable gains in standard ranking metrics as well as user experience indicators, fully validating the effectiveness and practical value of our approach in real-world search scenarios.

12. 【2510.16736】Exact Nearest-Neighbor Search on Energy-Efficient FPGA Devices

链接https://arxiv.org/abs/2510.16736

作者:Patrizio Dazzi,William Guglielmo,Franco Maria Nardini,Raffaele Perego,Salvatore Trani

类目:Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:high-dimension latent spaces, exact kNN search, latent spaces, energy-efficient exact kNN, investigates the usage

备注

点击查看摘要

Abstract:This paper investigates the usage of FPGA devices for energy-efficient exact kNN search in high-dimension latent spaces. This work intercepts a relevant trend that tries to support the increasing popularity of learned representations based on neural encoder models by making their large-scale adoption greener and more inclusive. The paper proposes two different energy-efficient solutions adopting the same FPGA low-level configuration. The first solution maximizes system throughput by processing the queries of a batch in parallel over a streamed dataset not fitting into the FPGA memory. The second minimizes latency by processing each kNN incoming query in parallel over an in-memory dataset. Reproducible experiments on publicly available image and text datasets show that our solution outperforms state-of-the-art CPU-based competitors regarding throughput, latency, and energy consumption. Specifically, experiments show that the proposed FPGA solutions achieve the best throughput in terms of queries per second and the best-observed latency with scale-up factors of up to 16.6X. Similar considerations can be made regarding energy efficiency, where results show that our solutions can achieve up to 11.9X energy saving w.r.t. strong CPU-based competitors.

13. 【2510.16715】Right Answer at the Right Time - Temporal Retrieval-Augmented Generation via Graph Summarization

链接https://arxiv.org/abs/2510.16715

作者:Zulun Zhu,Haoyu Liu,Mengke He,Siqiang Luo

类目:Information Retrieval (cs.IR)

关键词:Question answering, knowledge graphs requires, graphs requires retrieval, temporal knowledge graphs, Question

备注

点击查看摘要

Abstract:Question answering in temporal knowledge graphs requires retrieval that is both time-consistent and efficient. Existing RAG methods are largely semantic and typically neglect explicit temporal constraints, which leads to time-inconsistent answers and inflated token usage. We propose STAR-RAG, a temporal GraphRAG framework that relies on two key ideas: building a time-aligned rule graph and conducting propagation on this graph to narrow the search space and prioritize semantically relevant, time-consistent evidence. This design enforces temporal proximity during retrieval, reduces the candidate set of retrieval results, and lowers token consumption without sacrificing accuracy. Compared with existing temporal RAG approaches, STAR-RAG eliminates the need for heavy model training and fine-tuning, thereby reducing computational cost and significantly simplifying this http URL experiments on real-world temporal KG datasets show that our method achieves improved answer accuracy while consuming fewer tokens than strong GraphRAG baselines.

14. 【2510.16695】Resolution-Aware Retrieval Augmented Zero-Shot Forecasting

链接https://arxiv.org/abs/2510.16695

作者:Iman Deznabi,Peeyush Kumar,Madalina Fiterau

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:previously unseen conditions, posing a significant, Zero-shot forecasting aims, aims to predict, predict outcomes

备注

点击查看摘要

Abstract:Zero-shot forecasting aims to predict outcomes for previously unseen conditions without direct historical data, posing a significant challenge for traditional forecasting methods. We introduce a Resolution-Aware Retrieval-Augmented Forecasting model that enhances predictive accuracy by leveraging spatial correlations and temporal frequency characteristics. By decomposing signals into different frequency components, our model employs resolution-aware retrieval, where lower-frequency components rely on broader spatial context, while higher-frequency components focus on local influences. This allows the model to dynamically retrieve relevant data and adapt to new locations with minimal historical context. Applied to microclimate forecasting, our model significantly outperforms traditional forecasting methods, numerical weather prediction models, and modern foundation time series models, achieving 71% lower MSE than HRRR and 34% lower MSE than Chronos on the ERA5 dataset. Our results highlight the effectiveness of retrieval-augmented and resolution-aware strategies, offering a scalable and data-efficient solution for zero-shot forecasting in microclimate modeling and beyond.

Subjects:

Machine Learning (cs.LG); Information Retrieval (cs.IR)

Cite as:
arXiv:2510.16695 [cs.LG]

(or
arXiv:2510.16695v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.16695

Focus to learn more

              arXiv-issued DOI via DataCite</p>
15. 【2510.16662】Safire: Similarity Framework for Visualization Retrieval

链接https://arxiv.org/abs/2510.16662

作者:Huyen N. Nguyen,Nils Gehlenborg

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Effective visualization retrieval, Effective visualization, visualization retrieval necessitates, visualization retrieval, visualization retrieval systems

备注: To appear in IEEE VIS 2025

点击查看摘要

Abstract:Effective visualization retrieval necessitates a clear definition of similarity. Despite the growing body of work in specialized visualization retrieval systems, a systematic approach to understanding visualization similarity remains absent. We introduce the Similarity Framework for Visualization Retrieval (Safire), a conceptual model that frames visualization similarity along two dimensions: comparison criteria and representation modalities. Comparison criteria identify the aspects that make visualizations similar, which we divide into primary facets (data, visual encoding, interaction, style, metadata) and derived properties (data-centric and human-centric measures). Safire connects what to compare with how comparisons are executed through representation modalities. We categorize existing representation approaches into four groups based on their levels of information content and visualization determinism: raster image, vector image, specification, and natural language description, together guiding what is computable and comparable. We analyze several visualization retrieval systems using Safire to demonstrate its practical value in clarifying similarity considerations. Our findings reveal how particular criteria and modalities align across different use cases. Notably, the choice of representation modality is not only an implementation detail but also an important decision that shapes retrieval capabilities and limitations. Based on our analysis, we provide recommendations and discuss broader implications for multimodal learning, AI applications, and visualization reproducibility.

16. 【2510.16635】Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis

链接https://arxiv.org/abs/2510.16635

作者:Wonduk Seo,Juhyeon Lee,Junseo Koh,Hyunjin An,Jian Park,Seunghyun Lee,Haihua Chen,Yi Bu

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, Large Language, performance of Large, effective alternative

备注: Preprint

点击查看摘要

Abstract:Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.

17. 【2510.16597】FRONTIER-RevRec: A Large-scale Dataset for Reviewer Recommendation

链接https://arxiv.org/abs/2510.16597

作者:Qiyao Peng,Chen Wang,Yinghui Wang,Hongtao Liu,Xuan Guo,Wenjun Wang

类目:Information Retrieval (cs.IR)

关键词:critical task, task for enhancing, enhancing the efficiency, academic publishing workflows, publishing workflows

备注

点击查看摘要

Abstract:Reviewer recommendation is a critical task for enhancing the efficiency of academic publishing workflows. However, research in this area has been persistently hindered by the lack of high-quality benchmark datasets, which are often limited in scale, disciplinary scope, and comparative analyses of different methodologies. To address this gap, we introduce FRONTIER-RevRec, a large-scale dataset constructed from authentic peer review records (2007-2025) from the Frontiers open-access publishing platform this https URL. The dataset contains 177941 distinct reviewers and 478379 papers across 209 journals spanning multiple disciplines including clinical medicine, biology, psychology, engineering, and social sciences. Our comprehensive evaluation on this dataset reveals that content-based methods significantly outperform collaborative filtering. This finding is explained by our structural analysis, which uncovers fundamental differences between academic recommendation and commercial domains. Notably, approaches leveraging language models are particularly effective at capturing the semantic alignment between a paper's content and a reviewer's expertise. Furthermore, our experiments identify optimal aggregation strategies to enhance the recommendation pipeline. FRONTIER-RevRec is intended to serve as a comprehensive benchmark to advance research in reviewer recommendation and facilitate the development of more effective academic peer review systems. The FRONTIER-RevRec dataset is available at: this https URL.

18. 【2510.16576】Enhancing Channel Estimation in RIS-aided Systems via Observation Matrix Design

链接https://arxiv.org/abs/2510.16576

作者:Zijian Zhang,Mingyao Cui

类目:Information Theory (cs.IT); Information Retrieval (cs.IR); Signal Processing (eess.SP); Systems and Control (eess.SY)

关键词:Reconfigurable intelligent surfaces, dense antenna arrays, enhancing wireless communications, Reconfigurable intelligent, intelligent surfaces

备注: 5 pages, 2 figures

点击查看摘要

Abstract:Reconfigurable intelligent surfaces (RISs) have emerged as a promising technology for enhancing wireless communications through dense antenna arrays. Accurate channel estimation is critical to unlocking their full performance potential. To enhance RIS channel estimators, this paper proposes a novel observation matrix design scheme. Bayesian optimization framework is adopted to generate observation matrices that maximize the mutual information between received pilot signals and RIS channels. To solve the formulated problem efficiently, we develop an alternating Riemannian manifold optimization (ARMO) algorithm to alternately update the receiver combiners and RIS phase-shift matrices. An adaptive kernel training strategy is further introduced to iteratively refine the channel covariance matrix without requiring additional pilot resources. Simulation results demonstrate that the proposed ARMO-enhanced estimator achieves substantial gains in estimation accuracy over state-of-the-art methods.

19. 【2510.16393】Blending Learning to Rank and Dense Representations for Efficient and Effective Cascades

链接https://arxiv.org/abs/2510.16393

作者:Franco Maria Nardini,Raffaele Perego,Nicola Tonellotto,Salvatore Trani

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG); Performance (cs.PF)

关键词:ad-hoc passage retrieval, neural relevance signals, investigate the exploitation, dense neural representations, hand-crafted lexical features

备注

点击查看摘要

Abstract:We investigate the exploitation of both lexical and neural relevance signals for ad-hoc passage retrieval. Our exploration involves a large-scale training dataset in which dense neural representations of MS-MARCO queries and passages are complemented and integrated with 253 hand-crafted lexical features extracted from the same corpus. Blending of the relevance signals from the two different groups of features is learned by a classical Learning-to-Rank (LTR) model based on a forest of decision trees. To evaluate our solution, we employ a pipelined architecture where a dense neural retriever serves as the first stage and performs a nearest-neighbor search over the neural representations of the documents. Our LTR model acts instead as the second stage that re-ranks the set of candidates retrieved by the first stage to enhance effectiveness. The results of reproducible experiments conducted with state-of-the-art dense retrievers on publicly available resources show that the proposed solution significantly enhances the end-to-end ranking performance while relatively minimally impacting efficiency. Specifically, we achieve a boost in nDCG@10 of up to 11% with an increase in average query latency of only 4.3%. This confirms the advantage of seamlessly combining two distinct families of signals that mutually contribute to retrieval effectiveness.

20. 【2510.16334】Investigating the Association Between Text-Based Indications of Foodborne Illness from Yelp Reviews and New York City Health Inspection Outcomes (2023)

链接https://arxiv.org/abs/2510.16334

作者:Eden Shaveet,Crystal Su,Daniel Hsu,Luis Gravano

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:gastrointestinal conditions caused, consuming contaminated food, Foodborne illnesses, illnesses are gastrointestinal, gastrointestinal conditions

备注: Presented as a poster at Data Science Day 2024

点击查看摘要

Abstract:Foodborne illnesses are gastrointestinal conditions caused by consuming contaminated food. Restaurants are critical venues to investigate outbreaks because they share sourcing, preparation, and distribution of foods. Public reporting of illness via formal channels is limited, whereas social media platforms host abundant user-generated content that can provide timely public health signals. This paper analyzes signals from Yelp reviews produced by a Hierarchical Sigmoid Attention Network (HSAN) classifier and compares them with official restaurant inspection outcomes issued by the New York City Department of Health and Mental Hygiene (NYC DOHMH) in 2023. We evaluate correlations at the Census tract level, compare distributions of HSAN scores by prevalence of C-graded restaurants, and map spatial patterns across NYC. We find minimal correlation between HSAN signals and inspection scores at the tract level and no significant differences by number of C-graded restaurants. We discuss implications and outline next steps toward address-level analyses.

21. 【2510.16302】DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA

链接https://arxiv.org/abs/2510.16302

作者:Changhao Wang,Yanfang Liu,Xinxin Fan,Anzhi Zhou,Lao Tian,Yunfeng Lu

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, modern large language, Multi-hop reasoning, reasoning, plays a critical

备注: 13 pages, 5 figures

点击查看摘要

Abstract:Multi-hop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for modern large language models (LLMs). The accurate answer can be obtained through retrieving relational structure of entities from knowledge graph (KG). Regarding the inherent relation-dependency and reasoning pattern, multi-hop reasoning can be in general classified into two categories: i) parallel fact-verification multi-hop reasoning question, i.e., requiring simultaneous verifications of multiple independent sub-questions; and ii) chained multi-hop reasoning questions, i.e., demanding sequential multi-step inference with intermediate conclusions serving as essential premises for subsequent reasoning. Currently, the multi-hop reasoning approaches singly employ one of two techniques: LLM response-based fact verification and KG path-based chain construction. Nevertheless, the former excels at parallel fact-verification but underperforms on chained reasoning tasks, while the latter demonstrates proficiency in chained multi-hop reasoning but suffers from redundant path retrieval when handling parallel fact-verification reasoning. These limitations deteriorate the efficiency and accuracy for multi-hop QA tasks. To address this challenge, we propose a novel dual-track KG verification and reasoning framework DTKG, which is inspired by the Dual Process Theory in cognitive science. Specifically, DTKG comprises two main stages: the Classification Stage and the Branch Processing Stage.

22. 【2510.16076】BPL: Bias-adaptive Preference Distillation Learning for Recommender System

链接https://arxiv.org/abs/2510.16076

作者:SeongKu Kang,Jianxun Lian,Dongha Lee,Wonbin Kweon,Sanghwan Jang,Jaehyun Lee,Jindong Wang,Xing Xie,Hwanjo Yu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Recommender systems suffer, Recommender systems, incompletely reveal user, test, reveal user preference

备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Recommender systems suffer from biases that cause the collected feedback to incompletely reveal user preference. While debiasing learning has been extensively studied, they mostly focused on the specialized (called counterfactual) test environment simulated by random exposure of items, significantly degrading accuracy in the typical (called factual) test environment based on actual user-item interactions. In fact, each test environment highlights the benefit of a different aspect: the counterfactual test emphasizes user satisfaction in the long-terms, while the factual test focuses on predicting subsequent user behaviors on platforms. Therefore, it is desirable to have a model that performs well on both tests rather than only one. In this work, we introduce a new learning framework, called Bias-adaptive Preference distillation Learning (BPL), to gradually uncover user preferences with dual distillation strategies. These distillation strategies are designed to drive high performance in both factual and counterfactual test environments. Employing a specialized form of teacher-student distillation from a biased model, BPL retains accurate preference knowledge aligned with the collected feedback, leading to high performance in the factual test. Furthermore, through self-distillation with reliability filtering, BPL iteratively refines its knowledge throughout the training process. This enables the model to produce more accurate predictions across a broader range of user-item combinations, thereby improving performance in the counterfactual test. Comprehensive experiments validate the effectiveness of BPL in both factual and counterfactual tests. Our implementation is accessible via: this https URL.

计算机视觉

1. 【2510.17803】ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

链接https://arxiv.org/abs/2510.17803

作者:Zixin Yin,Ling-Hao Chen,Lionel Ni,Xili Dai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, existing generation models, efficient text-guided editing, text-guided editing capabilities, generation models

备注: SIGGRAPH Asia 2025

点击查看摘要

Abstract:Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

2. 【2510.17800】Glyph: Scaling Context Windows via Visual-Text Compression

链接https://arxiv.org/abs/2510.17800

作者:Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, increasingly rely, multi-step reasoning, Large

备注

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at this https URL.

3. 【2510.17790】UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

链接https://arxiv.org/abs/2510.17790

作者:Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:require accurate visual, accurate visual grounding, lengthy execution chains, Multimodal agents, programmatic tool calls

备注

点击查看摘要

Abstract:Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

4. 【2510.17783】Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats

链接https://arxiv.org/abs/2510.17783

作者:Simeon Adebola,Chung Min Kim,Justin Kerr,Shuangyu Xie,Prithvi Akella,Jose Luis Susa Rincon,Eugen Solowjow,Ken Goldberg

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Commercial plant phenotyping, Commercial plant, segmentated Gaussian Splat, Gaussian Splat models, plant phenotyping systems

备注: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

点击查看摘要

Abstract:Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed "annotated digital twins" of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at this https URL.

5. 【2510.17777】SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

链接https://arxiv.org/abs/2510.17777

作者:Samir Khaki,Junxian Guo,Jiaming Tang,Shang Yang,Yukang Chen,Konstantinos N. Plataniotis,Yao Lu,Song Han,Zhijian Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, Language Models, high-resolution image understanding, long-video analysis

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

6. 【2510.17773】owards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion

链接https://arxiv.org/abs/2510.17773

作者:Md. Enamul Atiq,Shaikh Anowarul Fattah

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:early detection significantly, life-threatening disease, disease where early, early detection, Dual Attention Gates

备注: 15 pages, 7 Figures, 3 Tables

点击查看摘要

Abstract:Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as "black boxes," limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model's reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model's predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.

7. 【2510.17771】Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

链接https://arxiv.org/abs/2510.17771

作者:Zhining Liu,Ziyi Chen,Hui Liu,Chen Luo,Xianfeng Tang,Suhang Wang,Joy Zeng,Zhenwei Dai,Zhan Shi,Tianxin Wei,Benoit Dumoulin,Hanghang Tong

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-Language Models, visual question answering, achieve strong results, achieve strong, question answering

备注: 21 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

8. 【2510.17759】VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

链接https://arxiv.org/abs/2510.17759

作者:Qilin Liao,Anamika Lochab,Ruqi Zhang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:extend large language, large language models, extend large, visual reasoning, large language

备注: 18 pages, 7 Figures,

点击查看摘要

Abstract:Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

9. 【2510.17739】Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition

链接https://arxiv.org/abs/2510.17739

作者:Timur Ismagilov,Shakaiba Majeed,Michael Milford,Tan Viet Tuyen Nguyen,Sarvapali D. Ramchurn,Shoaib Ehsan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improve localisation performance, visual place recognition, reference sets captured, address multi-reference visual, localisation performance

备注: 13 pages

点击查看摘要

Abstract:We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.

10. 【2510.17731】Can Image-To-Video Models Simulate Pedestrian Dynamics?

链接https://arxiv.org/abs/2510.17731

作者:Aaron Appelle,Jerome P. Lynch

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scale video datasets, Recent high-performing, displayed remarkable inherent, remarkable inherent world-modeling, inherent world-modeling capabilities

备注: Appeared in the ICML 2025 Workshop on Building Physically Plausible World Models, July 2025, [this https URL](https://physical-world-modeling.github.io/)

点击查看摘要

Abstract:Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.

11. 【2510.17724】Signature Forgery Detection: Improving Cross-Dataset Generalization

链接https://arxiv.org/abs/2510.17724

作者:Matheus Ramos Parracho

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:critical biometric technique, Automated signature verification, identity authentication, Automated signature, legal documentation

备注: Undergraduate thesis (preprint)---submitted to Escola Politécnica, Universidade Federal do Rio de Janeiro (POLI/UFRJ). The final version will include official signatures and defense approval

点击查看摘要

Abstract:Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization -- that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks -- CEDAR, ICDAR, and GPDS Synthetic -- two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.

12. 【2510.17722】MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

链接https://arxiv.org/abs/2510.17722

作者:Yaning Pan,Zekun Wang,Qianqian Xie,Yongqian Wen,Yuanxing Zhang,Guohui Zhang,Haoxuan Hu,Zhiyu Pan,Yibing Huang,Zhidong Gan,Yonghong Lin,An Ping,Tianhao Peng,Jiaheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注: Project Website: [this https URL](https://github.com/NJU-LINK/MT-Video-Bench)

点击查看摘要

Abstract:The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

13. 【2510.17719】Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions

链接https://arxiv.org/abs/2510.17719

作者:Zhiqiang Teng,Beibei Lin,Tingting Chen,Zifeng Yuan,Xuanyi Li,Xuanyu Zhang,Shunli Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, substantially degrading reconstruction, optical distortions caused, substantially degrading, suffers from severe

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.

14. 【2510.17716】Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging

链接https://arxiv.org/abs/2510.17716

作者:Suqiang Ma,Subhadeep Sengupta,Yao Lee,Beikang Gu,Xianyan Chen,Xianqiao Wang,Yang Liu,Mengjia Xu,Galit H. Frydman,He Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Circulating blood cell, significant biomarkers linked, cell, cell clusters, Circulating blood

备注

点击查看摘要

Abstract:Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.

15. 【2510.17703】Improving Cross-Patient Generalization in Parkinson's Disease Detection through Chunk-Based Analysis of Hand-Drawn Patterns

链接https://arxiv.org/abs/2510.17703

作者:Mhd Adnan Albani,Riad Sonbol

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:causing motor impairments, impede hand coordination, hand coordination activities, Parkinson disease, neurodegenerative disease affecting

备注: 19 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Parkinson's disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson's disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson's disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson's disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson's disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.

16. 【2510.17700】Elastic ViTs from Pretrained Models without Retraining

链接https://arxiv.org/abs/2510.17700

作者:Walter Simoncini,Michael Dorkenwald,Tijmen Blankevoort,Cees G.M. Snoek,Yuki M. Asano

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:forcing sub-optimal deployment, sub-optimal deployment choices, Vision foundation models, Vision Transformers, foundation models achieve

备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: this https URL

17. 【2510.17699】GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver

链接https://arxiv.org/abs/2510.17699

作者:Aleksandr Oganov,Ilya Bykov,Eva Neudachina,Mishan Aliev,Alexander Tolmachev,Alexander Sidorov,Aleksandr Zuev,Andrey Okhotin,Denis Rakitin,Aibek Alanov

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diffusion models achieve, computationally expensive sampling, models achieve, suffer from computationally, computationally expensive

备注

点击查看摘要

Abstract:While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at this https URL.

18. 【2510.17686】owards 3D Objectness Learning in an Open World

链接https://arxiv.org/abs/2510.17686

作者:Taichi Liu,Zhenyu Wang,Ruofeng Liu,Guang Wang,Desheng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made significant progress, objectness remains insufficient, Recent advancements, category detection, significant progress

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.

19. 【2510.17685】Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning

链接https://arxiv.org/abs/2510.17685

作者:Min Cao,Xinyu Zhou,Ding Jiang,Bo Du,Mang Ye,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:person retrieval, aims to identify, textual descriptions, facing challenge, target person

备注: Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: [this https URL](https://ieeexplore.ieee.org/document/11199360)

点击查看摘要

Abstract:Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in this https URL.

20. 【2510.17684】Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model

链接https://arxiv.org/abs/2510.17684

作者:Xinwei Zhang,Hu Chen,Zhe Yuan,Sukun Tian,Peng Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieved remarkable performance, image segmentation foundation, medical image segmentation, image segmentation, medical image

备注

点击查看摘要

Abstract:Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.

21. 【2510.17681】PICABench: How Far Are We from Physically Realistic Image Editing?

链接https://arxiv.org/abs/2510.17681

作者:Yuandong Pu,Le Zhuo,Songhao Han,Jinbo Xing,Kaiwen Zhu,Shuo Cao,Bin Fu,Si Liu,Hongsheng Li,Yu Qiao,Wenlong Zhang,Xi Chen,Yihao Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remarkable progress recently, achieved remarkable progress, progress recently, achieved remarkable, remarkable progress

备注

点击查看摘要

Abstract:Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

22. 【2510.17664】4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads

链接https://arxiv.org/abs/2510.17664

作者:Ling Liu,Jun Tian,Li Yi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:evacuating dense crowds, constrained time budget, highly dynamic environments, budget is essential, setting is critical

备注

点击查看摘要

Abstract:4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.

23. 【2510.17651】Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

链接https://arxiv.org/abs/2510.17651

作者:Sébastien Thuau,Siba Haidar,Ayush Bajracharya,Rachid Chelouah

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:convolutional neural network, frugal federated learning, examine frugal federated, complementary strategies, convolutional neural

备注: 7 pages, 1 figure, FLTA 2025

点击查看摘要

Abstract:We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

24. 【2510.17650】ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification

链接https://arxiv.org/abs/2510.17650

作者:Athanasios Angelakis,Amne Mousa,Micah L. A. Heldeweg,Laurens A. Biesheuvel,Mark A. Haaksma,Jasper M. Smit,Pieter R. Tuinman,Paul W. G. Elbers

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Differentiating cardiogenic pulmonary, interstitial lung disease, structurally normal lungs, cardiogenic pulmonary oedema, remains challenging due

备注: 14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: [this https URL](https://github.com/Bluesman79/ZACH-ViT) Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License

点击查看摘要

Abstract:Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.

25. 【2510.17644】Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives

链接https://arxiv.org/abs/2510.17644

作者:Zexian Huang,Mashnoon Islam,Brian Armstrong,Kourosh Khoshelham,Martin Tomko

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:walls hold significant, hold significant heritage, hold significant, Dry-stone walls hold, Dry-stone walls

备注

点击查看摘要

Abstract:Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia's densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.

26. 【2510.17626】CaMiT: A Time-Aware Car Model Dataset for Classification and Generation

链接https://arxiv.org/abs/2510.17626

作者:Frédéric LIN,Biruk Abere Ambaw,Adrian Popescu,Hejer Ammar,Romaric Audigier,Hervé Le Borgne(Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:object appearances change, systems must adapt, domains where object, object appearances, appearances change

备注: To be published in NeurIPS 2025 Track on Datasets and Benchmarks

点击查看摘要

Abstract:AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.

27. 【2510.17617】ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

链接https://arxiv.org/abs/2510.17617

作者:Hendric Voss,Stefan Kopp

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:manifold communicative functions, serve manifold communicative, expressive nonverbal cues, nonverbal cues, serve manifold

备注

点击查看摘要

Abstract:Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: this https URL

28. 【2510.17611】One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection

链接https://arxiv.org/abs/2510.17611

作者:Jia Guo,Shuai Lu,Lei Fan,Zelin Li,Donglin Di,Yang Song,Weihang Zhang,Wenbing Zhu,Hong Yan,Fang Chen,Huiqi Li,Hongen Liao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models significantly underperform, Unsupervised anomaly detection, Unsupervised anomaly, existing multi-class models, multi-class models significantly

备注: Extended version of CVPR2025

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the "less is more" philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2's full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.

29. 【2510.17609】Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation

链接https://arxiv.org/abs/2510.17609

作者:Siqi Chen,Shanyue Guan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabled efficient, non-contact structural health, technology has enabled, UAV technology, Building Information Modeling

备注

点击查看摘要

Abstract:The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.

30. 【2510.17603】ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

链接https://arxiv.org/abs/2510.17603

作者:Shuyuan Zhang,Chenhan Jiang,Zuoou Li,Jiankang Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reduce expert manual, language offers significant, expert manual modeling, manual modeling efforts, offers significant potential

备注: NeurIPS 2025 Poster

点击查看摘要

Abstract:3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.

31. 【2510.17599】Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation

链接https://arxiv.org/abs/2510.17599

作者:Hendric Voss,Lisa Michelle Bohnenkamp,Stefan Kopp

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically-augmented variant, resulting movements, study explores, evaluate their ability, ability to convey

备注

点击查看摘要

Abstract:This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.

32. 【2510.17590】MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

链接https://arxiv.org/abs/2510.17590

作者:Mir Nafis Sharear Shopnil,Sharad Duwal,Abhishek Tyagi,Adiba Mahbub Proma

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:overwhelming manual fact-checking, manual fact-checking capacity, daily multimodal posts, overwhelming manual, fact-checking capacity

备注: 16 pages, 3 tables, 1 figure

点击查看摘要

Abstract:Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

33. 【2510.17585】Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset

链接https://arxiv.org/abs/2510.17585

作者:Chuhong Wang,Hua Li,Chongyi Li,Huazhong Liu,Xiongxin Tang,Sam Kwong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:camouflaged instance segmentation, underwater camouflaged instance, camouflaged instance, underwater vision tasks, instance segmentation

备注

点击查看摘要

Abstract:With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model's limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model's ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.

34. 【2510.17568】PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception

链接https://arxiv.org/abs/2510.17568

作者:Kaichen Zhou,Yuhan Wang,Grace Chen,Xinhai Chang,Gaspard Beaudouin,Fangneng Zhan,Paul Pu Liang,Mengyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Geometry Grounded Transformer, Visual Geometry Grounded, Grounded Transformer, shown strong capability, Visual Geometry

备注

点击查看摘要

Abstract:Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

35. 【2510.17566】WP-CrackNet: A Collaborative Adversarial Learning Framework for End-to-End Weakly-Supervised Road Crack Detection

链接https://arxiv.org/abs/2510.17566

作者:Nachuan Ma,Zhengfei Song,Qiang Hu,Xiaoyu Tang,Chengxi Zhang,Rui Fan,Lihua Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent infrastructure maintenance, Road crack detection, smart cities, essential for intelligent, intelligent infrastructure

备注

点击查看摘要

Abstract:Road crack detection is essential for intelligent infrastructure maintenance in smart cities. To reduce reliance on costly pixel-level annotations, we propose WP-CrackNet, an end-to-end weakly-supervised method that trains with only image-level labels for pixel-wise crack detection. WP-CrackNet integrates three components: a classifier generating class activation maps (CAMs), a reconstructor measuring feature inferability, and a detector producing pixel-wise road crack detection results. During training, the classifier and reconstructor alternate in adversarial learning to encourage crack CAMs to cover complete crack regions, while the detector learns from pseudo labels derived from post-processed crack CAMs. This mutual feedback among the three components improves learning stability and detection accuracy. To further boost detection performance, we design a path-aware attention module (PAAM) that fuses high-level semantics from the classifier with low-level structural cues from the reconstructor by modeling spatial and channel-wise dependencies. Additionally, a center-enhanced CAM consistency module (CECCM) is proposed to refine crack CAMs using center Gaussian weighting and consistency constraints, enabling better pseudo-label generation. We create three image-level datasets and extensive experiments show that WP-CrackNet achieves comparable results to supervised methods and outperforms existing weakly-supervised methods, significantly advancing scalable road inspection. The source code package and datasets are available at this https URL.

36. 【2510.17529】MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation

链接https://arxiv.org/abs/2510.17529

作者:Yovin Yahathugoda,Davide Prezzi,Piyalitt Ittichaiwong,Vicky Goh,Sebastien Ourselin,Michela Antonelli

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Active Surveillance, monitoring disease progression, intermediate-risk prostate cancer, aiming to avoid, clinical follow-up

备注

点击查看摘要

Abstract:Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.

37. 【2510.17519】MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

链接https://arxiv.org/abs/2510.17519

作者:Yongshun Zhang,Zhongyi Fan,Yonghang Zhang,Zhangzikang Li,Weifeng Chen,Zhongwei Feng,Chaoyue Wang,Peng Hou,Anxiang Zeng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made remarkable progress, large-scale video generation, video generation, video generation models, visual content

备注: Technical Report; Project Page: \href{ [this https URL](https://github.com/Shopee-MUG/MUG-V) }

点击查看摘要

Abstract:In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{this https URL}{our webpage}.

38. 【2510.17501】Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

链接https://arxiv.org/abs/2510.17501

作者:Yuanli Wu,Long Zhang,Yue Du,Bin Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:compressing long footage, social media, compressing long, surrogates is crucial, exploding across social

备注

点击查看摘要

Abstract:With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific this http URL propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter this http URL three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.

39. 【2510.17484】Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment

链接https://arxiv.org/abs/2510.17484

作者:Muhammad Umer Ramzan,Ali Zia,Abdelwahed Khamis,Noman Ali,Usman Ali,Wei Xiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Salient object detection, computer vision applications, segment visually prominent, visually prominent regions, Salient object

备注

点击查看摘要

Abstract:Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.

40. 【2510.17482】SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries

链接https://arxiv.org/abs/2510.17482

作者:Chenxu Dang,Haiyan Liu,Guangjun Bao,Pei An,Xinyue Tang,Jie Ma,Bingchuan Sun,Yan Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rich spatial semantics, spatial semantics, capture rich spatial, occupancy world models, Semantic occupancy

备注: Under Review

点击查看摘要

Abstract:Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real this http URL this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at this https URL.

41. 【2510.17479】Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS

链接https://arxiv.org/abs/2510.17479

作者:Feng Zhou,Wenkai Guo,Pu Cao,Zhicheng Zhang,Jianqin Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, leading to artifacts, artifacts like blurring, Splatting, training-time constraints

备注: A preprint paper

点击查看摘要

Abstract:Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emph{i.e.}, the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization's primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at this https URL.

42. 【2510.17440】Rethinking Nighttime Image Deraining via Learnable Color Space Transformation

链接https://arxiv.org/abs/2510.17440

作者:Qiyuan Guan,Xiang Chen,Guiyue Jin,Jiyu Jin,Shumin Fan,Tianyu Song,Jinshan Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses significant challenges, significant challenges due, nighttime image deraining, deraining poses significant, daytime image deraining

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Compared to daytime image deraining, nighttime image deraining poses significant challenges due to inherent complexities of nighttime scenarios and the lack of high-quality datasets that accurately represent the coupling effect between rain and illumination. In this paper, we rethink the task of nighttime image deraining and contribute a new high-quality benchmark, HQ-NightRain, which offers higher harmony and realism compared to existing datasets. In addition, we develop an effective Color Space Transformation Network (CST-Net) for better removing complex rain from nighttime scenes. Specifically, we propose a learnable color space converter (CSC) to better facilitate rain removal in the Y channel, as nighttime rain is more pronounced in the Y channel compared to the RGB color space. To capture illumination information for guiding nighttime deraining, implicit illumination guidance is introduced enabling the learned features to improve the model's robustness in complex scenarios. Extensive experiments show the value of our dataset and the effectiveness of our method. The source code and datasets are available at this https URL.

43. 【2510.17439】From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

链接https://arxiv.org/abs/2510.17439

作者:Zhengshen Zhang,Hao Li,Yalun Dai,Zhengbang Zhu,Lei Zhou,Chenchen Liu,Dong Wang,Francis E. H. Tay,Sijin Chen,Ziwei Liu,Yuxiao Liu,Xinghang Li,Pan Zhou

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:generalization and adaptability, typically built, gap that limits, limits generalization, spatial reasoning gap

备注: Project page: [this https URL](https://falcon-vla.github.io/)

点击查看摘要

Abstract:Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

44. 【2510.17434】Leveraging AV1 motion vectors for Fast and Dense Feature Matching

链接https://arxiv.org/abs/2510.17434

作者:Julien Zouein,Hossein Javidnia,François Pitié,Anil Kokaram

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:produce dense sub-pixel, short tracks filtered, dense sub-pixel correspondences, motion vectors, cosine consistency

备注: Accepted ICIR 2025, camera-ready version

点击查看摘要

Abstract:We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

45. 【2510.17422】DeepDetect: Learning All-in-One Dense Keypoints

链接https://arxiv.org/abs/2510.17422

作者:Shaharyar Ahmed Khan Tareen,Filza Khan Tareen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision tasks, including image registration, structure-from motion, vision tasks, computer vision

备注: 6 pages, 6 figures, 2 tables, 7 equations

点击查看摘要

Abstract:Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).

46. 【2510.17409】Monitoring Horses in Stalls: From Object to Event Detection

链接https://arxiv.org/abs/2510.17409

作者:Dmitrii Galimzianov,Viacheslav Vyshegorodtsev,Ivan Nezhivykh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:labor-intensive and time-consuming, behavior of stalled, essential for early, issues but remains, remains labor-intensive

备注: 12 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera's blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.

47. 【2510.17394】MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

链接https://arxiv.org/abs/2510.17394

作者:Alejandro Guerra-Manzanares,Farah E. Shamout

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse data sources, combine diverse data, multimodal, multimodal neural networks, data sources

备注: Accepted and presented at the 2025 International Joint Conference on Neural Networks (IJCNN'25). The paper was awarded an honorable mention (best 4 papers)

点击查看摘要

Abstract:The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.

48. 【2510.17384】Closed-Loop Transfer for Weakly-supervised Affordance Grounding

链接https://arxiv.org/abs/2510.17384

作者:Jiajin Tang,Zhengxuan Wei,Ge Zheng,Sibei Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:perform previously unexperienced, previously unexperienced interactions, perform previously, previously unexperienced, simply by observing

备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.

49. 【2510.17383】Latent Spaces Beyond Synthesis: From GANs to Diffusion Models

链接https://arxiv.org/abs/2510.17383

作者:Ludovica Schaerf

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:generative visual models, Beatrice Fazi account, paper examines, examines the evolving, evolving nature

备注: Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI'25)

点击查看摘要

Abstract:This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi's account of synthesis as the amalgamation of distributed representations, we propose a distinction between "synthesis in a strict sense", where a compact latent space wholly determines the generative process, and "synthesis in a broad sense," which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.

50. 【2510.17373】Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing

链接https://arxiv.org/abs/2510.17373

作者:Yintao Zhou,Wei Huang,Zhengyu Li,Jing Huang,Meng Pang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:adopting tailored interventions, early detecting potential, detecting potential patients, Parkinson disease, tailored interventions

备注: 3 pages, 2 figures, accepted by MIND 2025

点击查看摘要

Abstract:Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.

51. 【2510.17372】Beyond Real Faces: Synthetic Datasets Can Achieve Reliable Recognition Performance without Privacy Compromise

链接https://arxiv.org/abs/2510.17372

作者:Paweł Borsukiewicz,Fadi Boutros,Iyiola E. Olatunji,Charles Beumier,Wendkûuni C. Ouedraogo,Jacques Klein,Tegawendé F. Bissyandé

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieving high accuracy, high accuracy requires, accuracy requires massive, potential legal liabilities, regulations like GDPR

备注

点击查看摘要

Abstract:The deployment of facial recognition systems has created an ethical dilemma: achieving high accuracy requires massive datasets of real faces collected without consent, leading to dataset retractions and potential legal liabilities under regulations like GDPR. While synthetic facial data presents a promising privacy-preserving alternative, the field lacks comprehensive empirical evidence of its viability. This study addresses this critical gap through extensive evaluation of synthetic facial recognition datasets. We present a systematic literature review identifying 25 synthetic facial recognition datasets (2018-2025), combined with rigorous experimental validation. Our methodology examines seven key requirements for privacy-preserving synthetic data: identity leakage prevention, intra-class variability, identity separability, dataset scale, ethical data sourcing, bias mitigation, and benchmark reliability. Through experiments involving over 10 million synthetic samples, extended by a comparison of results reported on five standard benchmarks, we provide the first comprehensive empirical assessment of synthetic data's capability to replace real datasets. Best-performing synthetic datasets (VariFace, VIGFace) achieve recognition accuracies of 95.67% and 94.91% respectively, surpassing established real datasets including CASIA-WebFace (94.70%). While those images remain private, publicly available alternatives Vec2Face (93.52%) and CemiFace (93.22%) come close behind. Our findings reveal that they ensure proper intra-class variability while maintaining identity separability. Demographic bias analysis shows that, even though synthetic data inherits limited biases, it offers unprecedented control for bias mitigation through generation parameters. These results establish synthetic facial data as a scientifically viable and ethically imperative alternative for facial recognition research.

52. 【2510.17364】Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

链接https://arxiv.org/abs/2510.17364

作者:Vaggelis Dorovatas,Soroush Seifi,Gunshi Gupta,Rahaf Aljundi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Language Models, Video Large Language, Large Language, Language Models, understanding videos in-context

备注: NeurIPS 2025

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

53. 【2510.17363】M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

链接https://arxiv.org/abs/2510.17363

作者:U.V.B.L Udugama,George Vosselman,Francesco Nex

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:devices requires efficient, edge devices requires, requires efficient multi-task, Deploying real-time spatial, minimizing computational overhead

备注: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

点击查看摘要

Abstract:Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

54. 【2510.17347】Exploring The Missing Semantics In Event Modality

链接https://arxiv.org/abs/2510.17347

作者:Jingqian Wu,Shengpeng Xu,Yunbo Jia,Edmund Y. Lam

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high dynamic range, efficient motion capture, offer distinct advantages, Event cameras offer, semantic information

备注

点击查看摘要

Abstract:Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.

55. 【2510.17338】Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition

链接https://arxiv.org/abs/2510.17338

作者:Jiahao Huo,Mufhumudzi Muthivhi,Terence L. van Zyl,Fredrik Gustafsson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:closed world setting, Wildlife classification models, Wildlife classification, world setting, closed world

备注

点击查看摘要

Abstract:Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models' features and predicted logits. We propose a probability distribution based on an input's distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found this https URL.

56. 【2510.17332】DETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

链接https://arxiv.org/abs/2510.17332

作者:Zhaoran Zhao,Xinli Yue,Jianhui Sun,Yuhao Xie,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human-aligned evaluation paradigms, scalar quality prediction, Image Quality Assessment, human-aligned evaluation, evaluation paradigms

备注: Accepted to ICCV 2025 Workshop

点击查看摘要

Abstract:Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.

57. 【2510.17330】CharDiff: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration

链接https://arxiv.org/abs/2510.17330

作者:Gyuhwan Park,Kihyun Na,Injung Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:license plate images, including increasing evidential, license plate, License Plate Recognition, plate images

备注: 11 pages, 6 figures

点击查看摘要

Abstract:The significance of license plate image restoration goes beyond the preprocessing stage of License Plate Recognition (LPR) systems, as it also serves various purposes, including increasing evidential value, enhancing the clarity of visual interface, and facilitating further utilization of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character's guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff significantly outperformed the baseline restoration models in both restoration quality and recognition accuracy, achieving a 28% relative reduction in CER on the Roboflow-LP dataset, compared to the best-performing baseline model. These results indicate that the structured character-guided conditioning effectively enhances the robustness of diffusion-based license plate restoration and recognition in practical deployment scenarios.

58. 【2510.17322】A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World

链接https://arxiv.org/abs/2510.17322

作者:Wei Zhang,Zhanhao Hu,Xiao Li,Xiaopei Zhu,Xiaolin Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning-based object, defense methods, learning-based object detectors, adversarial, recent years

备注: 13 pages, 8 figures

点击查看摘要

Abstract:In recent years, adversarial attacks against deep learning-based object detectors in the physical world have attracted much attention. To defend against these attacks, researchers have proposed various defense methods against adversarial patches, a typical form of physically-realizable attack. However, our experiments showed that simply enlarging the patch size could make these defense methods fail. Motivated by this, we evaluated various defense methods against adversarial clothes which have large coverage over the human body. Adversarial clothes provide a good test case for adversarial defense against patch-based attacks because they not only have large sizes but also look more natural than a large patch on humans. Experiments show that all the defense methods had poor performance against adversarial clothes in both the digital world and the physical world. In addition, we crafted a single set of clothes that broke multiple defense methods on Faster R-CNN. The set achieved an Attack Success Rate (ASR) of 96.06% against the undefended detector and over 64.84% ASRs against nine defended models in the physical world, unveiling the common vulnerability of existing adversarial defense methods against adversarial clothes. Code is available at: this https URL.

59. 【2510.17318】CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference

链接https://arxiv.org/abs/2510.17318

作者:Sangyoon Bae,Jiook Cha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hemodynamically distorted BOLD, distorted BOLD signals, Dynamic Causal Modeling, inferring neural causality, Conditional Mamba architecture

备注

点击查看摘要

Abstract:We introduce CausalMamba, a scalable framework that addresses fundamental limitations in fMRI-based causal inference: the ill-posed nature of inferring neural causality from hemodynamically distorted BOLD signals and the computational intractability of existing methods like Dynamic Causal Modeling (DCM). Our approach decomposes this complex inverse problem into two tractable stages: BOLD deconvolution to recover latent neural activity, followed by causal graph inference using a novel Conditional Mamba architecture. On simulated data, CausalMamba achieves 37% higher accuracy than DCM. Critically, when applied to real task fMRI data, our method recovers well-established neural pathways with 88% fidelity, whereas conventional approaches fail to identify these canonical circuits in over 99% of subjects. Furthermore, our network analysis of working memory data reveals that the brain strategically shifts its primary causal hub-recruiting executive or salience networks depending on the stimulus-a sophisticated reconfiguration that remains undetected by traditional methods. This work provides neuroscientists with a practical tool for large-scale causal inference that captures both fundamental circuit motifs and flexible network dynamics underlying cognitive function.

60. 【2510.17305】LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

链接https://arxiv.org/abs/2510.17305

作者:ZhaoYang Han,Qihan Lin,Hao Liang,Bowen Chen,Zhou Liu,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:understand long videos, assess models' ability, Challenging Task Scenarios, rich language elements, textbf

备注: Submitted to ARR Rolling Review

点击查看摘要

Abstract:We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at this https URL.

61. 【2510.17299】Exploring Structural Degradation in Dense Representations for Self-supervised Learning

链接https://arxiv.org/abs/2510.17299

作者:Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantic segmentation, Self-supervised Dense Degradation, self-supervised learning, dense prediction tasks, observe a counterintuitive

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0\%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at this https URL.

62. 【2510.17287】Machine Vision-Based Surgical Lighting System:Design and Implementation

链接https://arxiv.org/abs/2510.17287

作者:Amir Gharghabi,Mahdi Hakiminezhad,Maryam Shafaei,Shaghayegh Gharghabi

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Effortless and ergonomically, ergonomically designed surgical, safety during procedures, ergonomically designed, critical for precision

备注

点击查看摘要

Abstract:Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.

63. 【2510.17278】SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation

链接https://arxiv.org/abs/2510.17278

作者:Mehdi Zekriyapanah Gashti,Mostafa Mohammadpour,Ghasem Farjamnia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:white blood cells, remain challenging due, Deep Feature Fusion, Accurate segmentation, blood cells

备注

点击查看摘要

Abstract:Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.

64. 【2510.17274】Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models

链接https://arxiv.org/abs/2510.17274

作者:Katie Luo,Jingwei Ji,Tong He,Runsheng Xu,Yichen Xie,Dragomir Anguelov,Mingxing Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current autonomous driving, autonomous driving systems, driving systems rely, Current autonomous, demonstrate reliable performance

备注: In proceedings of IROS 2025

点击查看摘要

Abstract:Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

65. 【2510.17269】FineVision: Open Data Is All You Need

链接https://arxiv.org/abs/2510.17269

作者:Luis Wiedmann,Orr Zohar,Amir Mahla,Xiaohan Wang,Rui Li,Thibaud Frere,Leandro von Werra,Aritra Roy Gosthipaty,Andrés Marafioti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:contaminated public datasets, advancement of vision-language, fragmented landscape, landscape of inconsistent, inconsistent and contaminated

备注

点击查看摘要

Abstract:The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

66. 【2510.17264】Fair and Interpretable Deepfake Detection in Videos

链接https://arxiv.org/abs/2510.17264

作者:Akihito Yoshii,Ryosuke Sonoda,Ramya Srinivasan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Existing deepfake detection, capture temporal information, Existing deepfake, lack transparency, leading to biased

备注: 10 pages (including References)

点击查看摘要

Abstract:Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.

67. 【2510.17247】From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

链接https://arxiv.org/abs/2510.17247

作者:Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, significantly enhanced, Recent, video generation, reward models trained

备注

点击查看摘要

Abstract:Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

68. 【2510.17234】aming Modality Entanglement in Continual Audio-Visual Segmentation

链接https://arxiv.org/abs/2510.17234

作者:Yuyang Hong,Qi Yang,Tao Zhang,Zili Wang,Zhaojin Fu,Kun Ding,Bin Fan,Shiming Xiang

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:continual learning, significant progress, continual learning settings, preserving performance, performance on previously

备注

点击查看摘要

Abstract:Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.

69. 【2510.17218】When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

链接https://arxiv.org/abs/2510.17218

作者:Zhuo Cao,Heming Du,Bingqing Zhang,Xin Yu,Xue Li,Sen Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Existing Moment retrieval, focus on Single-Moment, QV-M, Single-Moment Retrieval, Moment retrieval

备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at this https URL.

70. 【2510.17205】$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

链接https://arxiv.org/abs/2510.17205

作者:Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, achieved strong performance, significant computational overhead

备注: EMNLP 2025 Main

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: this https URL.

71. 【2510.17201】Optimizing DINOv2 with Registers for Face Anti-Spoofing

链接https://arxiv.org/abs/2510.17201

作者:Mika Feng,Pierre Gallin-Martel,Koichi Ito,Takafumi Aoki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:head pose, blur during capture, robust against variations, variations in head, Face recognition systems

备注: ICCV 2025 Workshop FAS

点击查看摘要

Abstract:Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.

72. 【2510.17200】EndoCIL: A Class-Incremental Learning Framework for Endoscopic Image Classification

链接https://arxiv.org/abs/2510.17200

作者:Bingrong Liu,Jun Shi,Yushan Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Class-incremental learning, real-world clinical applications, evolving clinical data, endoscopic image analysis, analysis is crucial

备注

点击查看摘要

Abstract:Class-incremental learning (CIL) for endoscopic image analysis is crucial for real-world clinical applications, where diagnostic models should continuously adapt to evolving clinical data while retaining performance on previously learned ones. However, existing replay-based CIL methods fail to effectively mitigate catastrophic forgetting due to severe domain discrepancies and class imbalance inherent in endoscopic imaging. To tackle these challenges, we propose EndoCIL, a novel and unified CIL framework specifically tailored for endoscopic image diagnosis. EndoCIL incorporates three key components: Maximum Mean Discrepancy Based Replay (MDBR), employing a distribution-aligned greedy strategy to select diverse and representative exemplars, Prior Regularized Class Balanced Loss (PRCBL), designed to alleviate both inter-phase and intra-phase class imbalance by integrating prior class distributions and balance weights into the loss function, and Calibration of Fully-Connected Gradients (CFG), which adjusts the classifier gradients to mitigate bias toward new classes. Extensive experiments conducted on four public endoscopic datasets demonstrate that EndoCIL generally outperforms state-of-the-art CIL methods across varying buffer sizes and evaluation metrics. The proposed framework effectively balances stability and plasticity in lifelong endoscopic diagnosis, showing promising potential for clinical scalability and deployment.

73. 【2510.17199】Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis

链接https://arxiv.org/abs/2510.17199

作者:Nirai Hayakawa,Kazumasa Shimari,Kazuma Yamasaki,Hirotatsu Hoshikawa,Rikuto Tsuchida,Kenichi Matsumoto

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:match log data, FPS game VALORANT, actively conducted, log data, data and statistical

备注: Accepted to IEEE 2025 Conference on Games

点击查看摘要

Abstract:Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.

74. 【2510.17198】From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh

链接https://arxiv.org/abs/2510.17198

作者:M Saifuzzaman Rafat,Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Jungpil Shin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:arteries of commerce, commerce and sustenance, relentless destruction, agents of relentless, Mokterer Char Union

备注: Submitted to the International Conference on Data and Applied Analytics (IDAA 2025). 15 pages, 5 figures, 4 tables

点击查看摘要

Abstract:The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh's most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM's mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.

75. 【2510.17197】ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

链接https://arxiv.org/abs/2510.17197

作者:Pu Zhang,Yuwei Li,Xingyuan Xian,Guoming Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:increasingly large inputs, process increasingly large, unlike in LLMs, large inputs, generates significant visual

备注

点击查看摘要

Abstract:As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

76. 【2510.17188】HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery

链接https://arxiv.org/abs/2510.17188

作者:Vaibhav Rathore,Divyam Gupta,Biplab Banerjee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalized Category Discovery, Generalized Category, Category Discovery, classify test-time samples, aims to classify

备注: Accpeted at NeurIPS (2025) Main Conference

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** -- available during training -- or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss -- combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion -- **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.

77. 【2510.17181】Capturing Head Avatar with Hand Contacts from a Monocular Video

链接https://arxiv.org/abs/2510.17181

作者:Haonan He,Yufeng Zheng,Jie Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vital for telepresence, Photorealistic, head avatars, ignoring natural hand-face, hand-face interactions

备注: ICCV 2025

点击查看摘要

Abstract:Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.

Comments:
ICCV 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.17181 [cs.CV]

(or
arXiv:2510.17181v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.17181

Focus to learn more

              arXiv-issued DOI via DataCite</p>
78. 【2510.17179】Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring

链接https://arxiv.org/abs/2510.17179

作者:Yingzi Han,Jiakai He,Chuanlong Xie,Jianping Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:models face significant, face significant challenges, real-world deployment due, recognition models face, Automated plankton recognition

备注

点击查看摘要

Abstract:Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton's complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite{875n-f104-21}, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \cite{wang2022vim} method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at this https URL.

79. 【2510.17171】Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

链接https://arxiv.org/abs/2510.17171

作者:Feihong Yan,Peiru Wang,Yao Zhu,Kaiyu Pang,Qingyan Wei,Huiqi Li,Linfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Masked Autoregressive, spatially correlated visual, potential remains constrained, correlated visual tokens, acceleration potential remains

备注: 12 pages, 6 figures

点击查看摘要

Abstract:Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in this https URL.

80. 【2510.17169】Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition

链接https://arxiv.org/abs/2510.17169

作者:Roland Croft,Brian Du,Darcy Joseph,Sharath Kumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exposing blind spots, protecting user privacy, subtly alter benign, benign facial images, alter benign facial

备注: Accepted for publication in DICTA 2025

点击查看摘要

Abstract:Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.

81. 【2510.17157】GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image

链接https://arxiv.org/abs/2510.17157

作者:Yinghui Wang,Xinyu Zhang,Peng Du

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:industrial concept design, holds great potential, single image holds, image holds great, Generating editable

备注

点击查看摘要

Abstract:Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.

82. 【2510.17148】DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment

链接https://arxiv.org/abs/2510.17148

作者:Yu Gao,Yiru Wang,Anqing Jiang,Heng Yuwen,Wang Shuo,Sun Hao,Wang Jijun

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:long-tail scenarios due, essential world knowledge, physically plausible trajectories, surrounding environments, world knowledge

备注

点击查看摘要

Abstract:Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.

83. 【2510.17137】KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation

链接https://arxiv.org/abs/2510.17137

作者:WenBo Xu,Liu Liu,Li Zhang,Ran Zhang,Hao Wu,Dan Guo,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:introduce structural diversity, exhibit significant challenges, Articulated Object Shape, pose estimation due, Object Shape Reconstruction

备注

点击查看摘要

Abstract:Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.

84. 【2510.17131】GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

链接https://arxiv.org/abs/2510.17131

作者:Xin Gao,Jiyao Liu,Guanghao Li,Yueming Lyu,Jianxiong Gao,Weichen Yu,Ningsheng Xu,Liang Wang,Caifeng Shan,Ziwei Liu,Chenyang Si

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advancements, advancements have explored, models for synthesizing, OOD, Recent

备注: 28 pages, 16 figures, conference

点击查看摘要

Abstract:Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier's latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.

85. 【2510.17120】Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation

链接https://arxiv.org/abs/2510.17120

作者:Rishi Sonthalia,Raj Rao Nadakuditi

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:regularization scheme, matricial free energy, free energy, free, matricial free

备注

点击查看摘要

Abstract:We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.

86. 【2510.17114】owards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras

链接https://arxiv.org/abs/2510.17114

作者:Hodaka Kawachi,Tomoya Nakamura,Hiroaki Santo,SaiKiran Kumar Tedla,Trevor Dalton Canham,Yasushi Yagi,Michael S. Brown

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:LED-based environmental lighting, produce visually imperceptible, visually imperceptible watermarks, paper introduces, LED-based environmental

备注

点击查看摘要

Abstract:This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source's spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system's sensitivity to visible spectra, modern consumer camera sensors' spectral sensitivity, and narrowband LEDs' ability to generate broadband spectra perceived as "white light" (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.

87. 【2510.17105】Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement

链接https://arxiv.org/abs/2510.17105

作者:Xiaogang Xu,Jian Wang,Yunfan Lu,Ruihang Chu,Ruixing Wang,Jiafei Wu,Bei Yu,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low-level vision tasks, leveraging pre-trained large, achieved remarkable performance, vision tasks, Stable Diffusion

备注

点击查看摘要

Abstract:Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.

88. 【2510.17101】Shape-aware Inertial Poser: Motion Tracking for Humans with Diverse Shapes Using Sparse Inertial Sensors

链接https://arxiv.org/abs/2510.17101

作者:Lu Yin,Ziying Shi,Yinghao Wu,Xinyu Yi,Feng Xu,Shihui Guo

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:significant attention recently, gained significant attention, Human motion capture, Human motion, body

备注: Accepted by SIGGRAPH Asia 2025 (TOG)

点击查看摘要

Abstract:Human motion capture with sparse inertial sensors has gained significant attention recently. However, existing methods almost exclusively rely on a template adult body shape to model the training data, which poses challenges when generalizing to individuals with largely different body shapes (such as a child). This is primarily due to the variation in IMU-measured acceleration caused by changes in body shape. To fill this gap, we propose Shape-aware Inertial Poser (SAIP), the first solution considering body shape differences in sparse inertial-based motion capture. Specifically, we decompose the sensor measurements related to shape and pose in order to effectively model their joint correlations. Firstly, we train a regression model to transfer the IMU-measured accelerations of a real body to match the template adult body model, compensating for the shape-related sensor measurements. Then, we can easily follow the state-of-the-art methods to estimate the full body motions of the template-shaped body. Finally, we utilize a second regression model to map the joint velocities back to the real body, combined with a shape-aware physical optimization strategy to calculate global motions on the subject. Furthermore, our method relies on body shape awareness, introducing the first inertial shape estimation scheme. This is accomplished by modeling the shape-conditioned IMU-pose correlation using an MLP-based network. To validate the effectiveness of SAIP, we also present the first IMU motion capture dataset containing individuals of different body sizes. This dataset features 10 children and 10 adults, with heights ranging from 110 cm to 190 cm, and a total of 400 minutes of paired IMU-Motion samples. Extensive experimental results demonstrate that SAIP can effectively handle motion capture tasks for diverse body shapes. The code and dataset are available at this https URL.

89. 【2510.17095】GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation

链接https://arxiv.org/abs/2510.17095

作者:Ruitong Gan,Junran Peng,Yang Liu,Chuanchen Luo,Qing Li,Zhaoxiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:urban streets, fundamental primitives, man-made environments, indoor spaces, spaces and urban

备注

点击查看摘要

Abstract:Planes are fundamental primitives of 3D sences, especially in man-made environments such as indoor spaces and urban streets. Representing these planes in a structured and parameterized format facilitates scene editing and physical simulations in downstream applications. Recently, Gaussian Splatting (GS) has demonstrated remarkable effectiveness in the Novel View Synthesis task, with extensions showing great potential in accurate surface reconstruction. However, even state-of-the-art GS representations often struggle to reconstruct planar regions with sufficient smoothness and precision. To address this issue, we propose GSPlane, which recovers accurate geometry and produces clean and well-structured mesh connectivity for plane regions in the reconstructed scene. By leveraging off-the-shelf segmentation and normal prediction models, GSPlane extracts robust planar priors to establish structured representations for planar Gaussian coordinates, which help guide the training process by enforcing geometric consistency. To further enhance training robustness, a Dynamic Gaussian Re-classifier is introduced to adaptively reclassify planar Gaussians with persistently high gradients as non-planar, ensuring more reliable optimization. Furthermore, we utilize the optimized planar priors to refine the mesh layouts, significantly improving topological structure while reducing the number of vertices and faces. We also explore applications of the structured planar representation, which enable decoupling and flexible manipulation of objects on supportive planes. Extensive experiments demonstrate that, with no sacrifice in rendering quality, the introduction of planar priors significantly improves the geometric accuracy of the extracted meshes across various baselines.

90. 【2510.17078】owards a Generalizable Fusion Architecture for Multimodal Object Detection

链接https://arxiv.org/abs/2510.17078

作者:Jad Berjawi,Yoann Dupas,Christophe C'erin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple sensor modalities, leveraging complementary cues, Modal Cross Attention, introduce Filtered Multi, robustness in chal

备注: 8 pages, 8 figures, accepted at ICCV 2025 MIRA Workshop

点击查看摘要

Abstract:Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.

91. 【2510.17068】ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding

链接https://arxiv.org/abs/2510.17068

作者:Zhe Luo,Wenjing Jia,Stuart Perry

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demanding real-time processing, augmented reality, autonomous driving, immersive communication, demanding real-time

备注

点击查看摘要

Abstract:Three-dimensional (3D) point clouds are becoming increasingly vital in applications such as autonomous driving, augmented reality, and immersive communication, demanding real-time processing and low latency. However, their large data volumes and bandwidth constraints hinder the deployment of high-quality services in resource-limited environments. Progres- sive coding, which allows for decoding at varying levels of detail, provides an alternative by allowing initial partial decoding with subsequent refinement. Although recent learning-based point cloud geometry coding methods have achieved notable success, their fixed latent representation does not support progressive decoding. To bridge this gap, we propose ProDAT, a novel density-aware tail-drop mechanism for progressive point cloud coding. By leveraging density information as a guidance signal, latent features and coordinates are decoded adaptively based on their significance, therefore achieving progressive decoding at multiple bitrates using one single model. Experimental results on benchmark datasets show that the proposed ProDAT not only enables progressive coding but also achieves superior coding efficiency compared to state-of-the-art learning-based coding techniques, with over 28.6% BD-rate improvement for PSNR- D2 on SemanticKITTI and over 18.15% for ShapeNet

92. 【2510.17051】How Universal Are SAM2 Features?

链接https://arxiv.org/abs/2510.17051

作者:Masoud Khairi Atani,Alon Harell,Hyomin Choi,Runyu Yang,Fabien Racape,Ivan V. Bajic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:foundation vision models, fully understood, counterparts is critical, vision models, general-purpose foundation vision

备注: This work has been accepted for publication in IEEE Picture Coding Symposium (PCS) 2025

点击查看摘要

Abstract:The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.

93. 【2510.17045】Video Reasoning without Training

链接https://arxiv.org/abs/2510.17045

作者:Deepak Sridhar,Kartikeya Bhardwaj,Jeya Pradha Jeyaraj,Nuno Vasconcelos,Ankita Nayak,Harris Teague

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Multimodal Models, Large Multimodal, costly reinforcement learning, substantial computational overhead, Multimodal Models

备注

点击查看摘要

Abstract:Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

94. 【2510.17043】Person Re-Identification via Generalized Class Prototypes

链接https://arxiv.org/abs/2510.17043

作者:Md Ahmed Al Muzaddid,William J. Beksi

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Advanced feature extraction, Advanced feature, feature extraction methods, feature extraction, significantly contributed

备注: 18 pages, 11 figures, and 4 tables

点击查看摘要

Abstract:Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results

95. 【2510.17039】Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework

链接https://arxiv.org/abs/2510.17039

作者:Mohammad R. Salmanpour,Sonya Falahati,Amir Hossein Pouria,Amin Mousavi,Somayeh Sadat Mehrnia,Morteza Alizadeh,Arman Gorji,Zeinab Farsangi,Alireza Safarian,Mehdi Maghsudi,Carlos Uribe,Arman Rahmim,Ren Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:central to screening, remains the leading, imaging central, Lung cancer remains, cancer mortality

备注: 13 pages, 2 figures, and 2 tables

点击查看摘要

Abstract:Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.

96. 【2510.17038】DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation

链接https://arxiv.org/abs/2510.17038

作者:Pedram Fekri,Majid Roshanfar,Samuel Barbeau,Seyedfarzad Famouri,Thomas Looi,Dale Podolsky,Mehrdad Zadeh,Javad Dargahi

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Cardiac catheterization remains, minimally invasive interventions, Cardiac catheterization, invasive interventions, manual operation

备注

点击查看摘要

Abstract:Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.

97. 【2510.17035】Conditional Synthetic Live and Spoof Fingerprint Generation

链接https://arxiv.org/abs/2510.17035

作者:Syed Konain Abbas,Sandip Purnapatra,M. G. Sarwar Murshed,Conor Miller-Lynch,Lambert Igene,Soumyabrata Dey,Stephanie Schuckers,Faraz Hussain

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large fingerprint datasets, strict privacy measures, require strict privacy, Large fingerprint, training and evaluation

备注

点击查看摘要

Abstract:Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fréchet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.

98. 【2510.17034】Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding

链接https://arxiv.org/abs/2510.17034

作者:Yutong Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:garnered considerable interest, advancing spatial reasoning, complex environments, garnered considerable, considerable interest

备注

点击查看摘要

Abstract:Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe "2D semantic bias" that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.

99. 【2510.17023】Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

链接https://arxiv.org/abs/2510.17023

作者:Shraman Pramanick,Effrosyni Mavroudi,Yale Song,Rama Chellappa,Lorenzo Torresani,Triantafyllos Afouras

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:utilizing multi-modal large, introduce ED-VTG, multi-modal large language, grounding utilizing multi-modal, utilizing multi-modal

备注: ICCV 2025 (Highlights)

点击查看摘要

Abstract:We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

100. 【2510.17014】Do Satellite Tasks Need Special Pretraining?

链接https://arxiv.org/abs/2510.17014

作者:Ani Vanyan,Alvard Barseghyan,Hakob Tamazyan,Tigran Galstyan,Vahan Huroyan,Naira Hovakimyan,Hrant Khachatrian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced machine learning, Foundation models, advanced machine, machine learning, remote sensing

备注

点击查看摘要

Abstract:Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.

101. 【2510.17007】An empirical study of the effect of video encoders on Temporal Video Grounding

链接https://arxiv.org/abs/2510.17007

作者:Ignacio M. De la Jara,Cristian Rodriguez-Opazo,Edison Marrese-Taylor,Felipe Bravo-Marquez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language query, Temporal video grounding, computer vision, aiming to localize, localize a natural

备注

点击查看摘要

Abstract:Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

102. 【2510.16989】raining-free Online Video Step Grounding

链接https://arxiv.org/abs/2510.16989

作者:Luca Zanella,Massimiliano Mancini,Yiming Wang,Alessio Tonioni,Elisa Ricci

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to detect, Large Multimodal Models, Large Multimodal, Multimodal Models, Video

备注: NeurIPS 2025. Project website at [this https URL](https://lucazanella.github.io/baglm/)

点击查看摘要

Abstract:Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

103. 【2510.16988】CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams

链接https://arxiv.org/abs/2510.16988

作者:Junhao Zhao,Zishuai Liu,Ruili Fang,Jin Lu,Linghan Zhang,Fei Dou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Ambient Assisted Living, Daily Living, Assisted Living, existing methods remain, methods remain constrained

备注

点击查看摘要

Abstract:The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.

104. 【2510.16983】One-step Diffusion Models with Bregman Density Ratio Matching

链接https://arxiv.org/abs/2510.16983

作者:Yuanzhi Zhu,Eleftherios Tsonis,Lucas Degeorge,Vicky Kalogeiton

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:slow multi-step sampling, remain computationally expensive, computationally expensive due, high generative quality, multi-step sampling

备注: work in progress

点击查看摘要

Abstract:Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.

105. 【2510.16973】Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis

链接https://arxiv.org/abs/2510.16973

作者:Praveenbalaji Rajendran,Mojtaba Safari,Wenfeng He,Mingzhe Hu,Shansong Wang,Jun Zhou,Xiaofeng Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

关键词:Recent advancements, diverse medical imaging, revolutionized medical image, medical image analysis, artificial intelligence

备注

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.

106. 【2510.16948】Unlocking Off-the-Grid Sparse Recovery with Unlimited Sensing: Simultaneous Super-Resolution in Time and Amplitude

链接https://arxiv.org/abs/2510.16948

作者:Ruiming Guo,Ayush Bhandari

类目:Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:Dirac impulses, signal processing, classical problem, problem in signal, filtered measurements

备注: 28 Pages, 10 figures. To appear in IEEE Journal of Selected Topics in Signal Processing

点击查看摘要

Abstract:The recovery of Dirac impulses, or spikes, from filtered measurements is a classical problem in signal processing. As the spikes lie in the continuous domain while measurements are discrete, this task is known as super-resolution or off-the-grid sparse recovery. Despite significant theoretical and algorithmic advances over the past decade, these developments often overlook critical challenges at the analog-digital interface. In particular, when spikes exhibit strong-weak amplitude disparity, conventional digital acquisition may result in clipping of strong components or loss of weak ones beneath the quantization noise floor. This motivates a broader perspective: super-resolution must simultaneously resolve both amplitude and temporal structure. Under a fixed bit budget, such information loss is unavoidable. In contrast, the emerging theory and practice of the Unlimited Sensing Framework (USF) demonstrate that these fundamental limitations can be overcome. Building on this foundation, we demonstrate that modulo encoding within USF enables digital super-resolution by enhancing measurement precision, thereby unlocking temporal super-resolution beyond conventional limits. We develop new theoretical results that extend to non-bandlimited kernels commonly encountered in practice and introduce a robust algorithm for off-the-grid sparse recovery. To demonstrate practical impact, we instantiate our framework in the context of time-of-flight imaging. Both numerical simulations and hardware experiments validate the effectiveness of our approach under low-bit quantization, enabling super-resolution in amplitude and time.

107. 【2510.16926】Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

链接https://arxiv.org/abs/2510.16926

作者:Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注: 23 pages,19 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

108. 【2510.16914】Domain Generalizable Continual Learning

链接https://arxiv.org/abs/2510.16914

作者:Hongwei Yan,Guanglong Sun,Zhiqi Kang,Yi Zhong,Liyuan Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:dynamic real-world environments, unseen scenarios, real-world environments, intelligent systems, adapt effectively

备注: 25 pages

点击查看摘要

Abstract:To adapt effectively to dynamic real-world environments, intelligent systems must continually acquire new skills while generalizing them to diverse, unseen scenarios. Here, we introduce a novel and realistic setting named domain generalizable continual learning (DGCL): a model learns sequential tasks with each involving a single domain, aiming to perform well across all encountered tasks and domains. This setting poses unique challenges in acquiring, retaining, and leveraging both semantic- and domain-relevant information for robust generalization. Although state-of-the-art continual learning (CL) methods have employed pre-trained models (PTMs) to enhance task-specific generalization, they typically assume identical training and testing domains for each task and therefore perform poorly in DGCL. To this end, we propose adaptive Domain Transformation (DoT), an innovative PTMs-based approach tailored to DGCL. Inspired by the distributed-plus-hub theory of the human brain, DoT disentangles semantic- and domain-relevant information in representation learning, and adaptively transforms task representations across various domains for output alignment, ensuring balanced and generalized predictions. DoT serves as a plug-in strategy that greatly facilitates state-of-the-art CL baselines under both full parameter tuning and parameter-efficient tuning paradigms in DGCL, validated by extensive experiments. Also, DoT is shown to accumulate domain-generalizable knowledge from DGCL, and ensure resource efficiency with a lightweight implementation.

109. 【2510.16913】Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation

链接https://arxiv.org/abs/2510.16913

作者:Akhila Kambhatla,Ahmed R Khaled

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RGB-based systems fail, visually obscured conditions, Thermal weapon segmentation, systems fail, lowlight and visually

备注: 9 Images with 1 figure and 3 Tables. This is a preprint submitted to arXiv

点击查看摘要

Abstract:Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3\+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15\%) and Pixel Accuracy (97.04\%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84\%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24\% mIoU, and DeepLabV3\+ R101-D8 reaches 92.76\% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.

110. 【2510.16891】Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data

链接https://arxiv.org/abs/2510.16891

作者:Ramon Dalmau,Gabriel Jarry,Philippe Very

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:climate impact, significant contributor, contrails, Aviation, temporal resolution

备注

点击查看摘要

Abstract:Aviation's non-CO2 effects, particularly contrails, are a significant contributor to its climate impact. Persistent contrails can evolve into cirrus-like clouds that trap outgoing infrared radiation, with radiative forcing potentially comparable to or exceeding that of aviation's CO2 emissions. While physical models simulate contrail formation, evolution and dissipation, validating and calibrating these models requires linking observed contrails to the flights that generated them, a process known as contrail-to-flight attribution. Satellite-based attribution is challenging due to limited spatial and temporal resolution, as contrails often drift and deform before detection. In this paper, we evaluate an alternative approach using ground-based cameras, which capture contrails shortly after formation at high spatial and temporal resolution, when they remain thin, linear, and visually distinct. Leveraging the ground visible camera contrail sequences (GVCCS) dataset, we introduce a modular framework for attributing contrails observed using ground-based cameras to theoretical contrails derived from aircraft surveillance and meteorological data. The framework accommodates multiple geometric representations and distance metrics, incorporates temporal smoothing, and enables flexible probability-based assignment strategies. This work establishes a strong baseline and provides a modular framework for future research in linking contrails to their source flight.

111. 【2510.16888】Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

链接https://arxiv.org/abs/2510.16888

作者:Zongjian Li,Zheyuan Liu,Qihui Zhang,Bin Lin,Shenghai Yuan,Zhiyuan Yan,Yang Ye,Wangbo Yu,Yuwei Niu,Li Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, Instruction-based image editing, Instruction-based image, Diffusion Negative-aware Finetuning, remarkable progress

备注

点击查看摘要

Abstract:Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at this https URL.

112. 【2510.16887】Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis

链接https://arxiv.org/abs/2510.16887

作者:Nusrat Munia,Abdullah Imran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable capability, high-quality synthetic data, generating high-quality synthetic, Generative models, including medical images

备注: EMBC 2025

点击查看摘要

Abstract:Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at this https URL.

113. 【2510.16877】Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning

链接https://arxiv.org/abs/2510.16877

作者:Heming Zou,Yunliang Zang,Wutong Xu,Xiangyang Ji

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:mitigate catastrophic forgetting, continual representation learning, representation learning paradigm, learning paradigm reframes, paradigm reframes parameter

备注

点击查看摘要

Abstract:Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL's effectiveness in addressing this challenge through a biologically inspired design. Code is available at this https URL.

114. 【2510.16870】Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding

链接https://arxiv.org/abs/2510.16870

作者:Yudan Ren,Xinlong Wang,Kexin Wang,Tian Xia,Zihan Ma,Zhaowei Li,Xiangrong Bi,Xiao Li,Xiaowei He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:research primarily focuses, unimodal ANN studies, high-level model outputs, ANN studies fail, ANN research primarily

备注: 14 pages, 7 figures

点击查看摘要

Abstract:While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain's inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain's fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP's independent branches show modality-specific specialization, whereas METER's cross-modal design yields unified cross-modal activation, highlighting the architecture's influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.

115. 【2510.16865】Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection

链接https://arxiv.org/abs/2510.16865

作者:Yuyang Yu,Zhengwei Chen,Xuemiao Xu,Lei Zhang,Haoxin Yang,Yongwei Nie,Shengfeng He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:industrial quality control, identify structural defects, quality control, aiming to identify, high reliability

备注

点击查看摘要

Abstract:3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability. However, current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity, particularly in capturing local geometric details and achieving rotation invariance. These limitations become more pronounced when registration fails, leading to unreliable detection results. We argue that point-cloud registration plays an essential role not only in aligning geometric structures but also in guiding feature extraction toward rotation-invariant and locally discriminative representations. To this end, we propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection. Our key insight is that both tasks rely on modeling local geometric structures and leveraging feature similarity across samples. By embedding feature extraction into the registration learning process, our framework jointly optimizes alignment and representation learning. This integration enables the network to acquire features that are both robust to rotations and highly effective for anomaly detection. Extensive experiments on the Anomaly-ShapeNet and Real3D-AD datasets demonstrate that our method consistently outperforms existing approaches in effectiveness and generalizability.

116. 【2510.16863】BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation

链接https://arxiv.org/abs/2510.16863

作者:Shujian Gao,Yuan Wang,Zekuan Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reducing annotation cost, Semi-supervised medical image, match fully supervised, fully supervised performance, sharply reducing annotation

备注: 14 pages, 5 figures

点击查看摘要

Abstract:Semi-supervised medical image segmentation (SSMIS) seeks to match fully supervised performance while sharply reducing annotation cost. Mainstream SSMIS methods rely on \emph{label-space consistency}, yet they overlook the equally critical \emph{representation-space alignment}. Without harmonizing latent features, models struggle to learn representations that are both discriminative and spatially coherent. To this end, we introduce \textbf{Bilateral Alignment in Representation and Label spaces (BARL)}, a unified framework that couples two collaborative branches and enforces alignment in both spaces. For label-space alignment, inspired by co-training and multi-scale decoding, we devise \textbf{Dual-Path Regularization (DPR)} and \textbf{Progressively Cognitive Bias Correction (PCBC)} to impose fine-grained cross-branch consistency while mitigating error accumulation from coarse to fine scales. For representation-space alignment, we conduct region-level and lesion-instance matching between branches, explicitly capturing the fragmented, complex pathological patterns common in medical imagery. Extensive experiments on four public benchmarks and a proprietary CBCT dataset demonstrate that BARL consistently surpasses state-of-the-art SSMIS methods. Ablative studies further validate the contribution of each component. Code will be released soon.

117. 【2510.16854】ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification

链接https://arxiv.org/abs/2510.16854

作者:Akhila Kambhatla,Taminul Islam,Khaled R Ahmed

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:weapon-related violence necessitates, violence necessitates automated, necessitates automated detection, automated detection systems, detection systems capable

备注: 9 pages with 4 figures and 5 tables. This is a preprint submitted to arXiv

点击查看摘要

Abstract:The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.

118. 【2510.16837】2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting

链接https://arxiv.org/abs/2510.16837

作者:Haofan Ren,Qingsong Yan,Ming Lu,Rongfeng Lu,Zunjie Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:influenced neural fields, greatly influenced neural, Gaussian Splatting, Recent advancements, enables high-fidelity rendering

备注

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1\% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.

119. 【2510.16833】From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display

链接https://arxiv.org/abs/2510.16833

作者:Xiangyu Mu,Dongliang Zhou,Jie Hou,Haijun Zhang,Weili Guan

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Mannequin-based clothing displays, online fashion presentation, clothing displays offer, displays offer, offer a cost-effective

备注

点击查看摘要

Abstract:Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods.

120. 【2510.16832】Robust Cross-Domain Adaptation in Texture Features Transferring for Wood Chip Moisture Content Prediction

链接https://arxiv.org/abs/2510.16832

作者:Abdur Rahman,Mohammad Marufuzzaman,Jason Street,Haifeng Wang,Veera G. Gude,Randy Buchanan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ensuring energy efficiency, optimizing biofuel production, wood chip, wood chip moisture, energy efficiency

备注

点击查看摘要

Abstract:Accurate and quick prediction of wood chip moisture content is critical for optimizing biofuel production and ensuring energy efficiency. The current widely used direct method (oven drying) is limited by its longer processing time and sample destructiveness. On the other hand, existing indirect methods, including near-infrared spectroscopy-based, electrical capacitance-based, and image-based approaches, are quick but not accurate when wood chips come from various sources. Variability in the source material can alter data distributions, undermining the performance of data-driven models. Therefore, there is a need for a robust approach that effectively mitigates the impact of source variability. Previous studies show that manually extracted texture features have the potential to predict wood chip moisture class. Building on this, in this study, we conduct a comprehensive analysis of five distinct texture feature types extracted from wood chip images to predict moisture content. Our findings reveal that a combined feature set incorporating all five texture features achieves an accuracy of 95% and consistently outperforms individual texture features in predicting moisture content. To ensure robust moisture prediction, we propose a domain adaptation method named AdaptMoist that utilizes the texture features to transfer knowledge from one source of wood chip data to another, addressing variability across different domains. We also proposed a criterion for model saving based on adjusted mutual information. The AdaptMoist method improves prediction accuracy across domains by 23%, achieving an average accuracy of 80%, compared to 57% for non-adapted models. These results highlight the effectiveness of AdaptMoist as a robust solution for wood chip moisture content estimation across domains, making it a potential solution for wood chip-reliant industries.

121. 【2510.16822】ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification

链接https://arxiv.org/abs/2510.16822

作者:Yahia Battach,Abdulwahab Felemban,Faizan Farooq Khan,Yousef A. Radwan,Xiang Li,Fabio Marchese,Sara Beery,Burton H. Jones,Francesca Benzoni,Mohamed Elhoseiny

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rapidly declining due, climate change, underscoring the urgent, rapidly declining, declining due

备注

点击查看摘要

Abstract:Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.

122. 【2510.16814】Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

链接https://arxiv.org/abs/2510.16814

作者:Simon Jaxy,Anton Theys,Patrick Willett,W. Chris Carleton,Ralf Vandam,Pieter Libin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:predictive modelling estimates, Archaeological predictive modelling, modelling estimates, occur by combining, Conditional Random Field

备注

点击查看摘要

Abstract:Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental, cultural, and geospatial variables. We address this challenge using a deep learning approach but must contend with structural label scarcity inherent to archaeology: positives are rare, and most locations are unlabeled. To address this, we adopt a semi-supervised, positive-unlabeled (PU) learning strategy, implemented as a semantic segmentation model and evaluated on two datasets covering a representative range of archaeological periods. Our approach employs dynamic pseudolabeling, refined with a Conditional Random Field (CRF) implemented via an RNN, increasing label confidence under severe class imbalance. On a geospatial dataset derived from a digital elevation model (DEM), our model performs on par with the state-of-the-art, LAMAP, while achieving higher Dice scores. On raw satellite imagery, assessed end-to-end with stratified k-fold cross-validation, it maintains performance and yields predictive surfaces with improved interpretability. Overall, our results indicate that semi-supervised learning offers a promising approach to identifying undiscovered sites across large, sparsely annotated landscapes.

123. 【2510.16800】An RGB-D Image Dataset for Lychee Detection and Maturity Classification for Robotic Harvesting

链接https://arxiv.org/abs/2510.16800

作者:Zhenpeng Zhang,Yi Wang,Shanglei Chai,Yingying Liu,Zekai Xie,Wenhao Huang,Pengyu Li,Zipei Luo,Dajiang Lu,Yibin Tian

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:high-value subtropical fruit, high-value subtropical, Lychee, images, harvesting robots

备注

点击查看摘要

Abstract:Lychee is a high-value subtropical fruit. The adoption of vision-based harvesting robots can significantly improve productivity while reduce reliance on labor. High-quality data are essential for developing such harvesting robots. However, there are currently no consistently and comprehensively annotated open-source lychee datasets featuring fruits in natural growing environments. To address this, we constructed a dataset to facilitate lychee detection and maturity classification. Color (RGB) images were acquired under diverse weather conditions, and at different times of the day, across multiple lychee varieties, such as Nuomici, Feizixiao, Heiye, and Huaizhi. The dataset encompasses three different ripeness stages and contains 11,414 images, consisting of 878 raw RGB images, 8,780 augmented RGB images, and 1,756 depth images. The images are annotated with 9,658 pairs of lables for lychee detection and maturity classification. To improve annotation consistency, three individuals independently labeled the data, and their results were then aggregated and verified by a fourth reviewer. Detailed statistical analyses were done to examine the dataset. Finally, we performed experiments using three representative deep learning models to evaluate the dataset. It is publicly available for academic

124. 【2510.16791】Personalized Image Filter: Mastering Your Photographic Style

链接https://arxiv.org/abs/2510.16791

作者:Chengxuan Zhu,Shuchen Weng,Jiacong Fang,Peixuan Zhang,Si Li,Chao Xu,Boxin Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Photographic style, photographic concepts, Photographic, renowned photographers, charm behind renowned

备注

点击查看摘要

Abstract:Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: this https URL

125. 【2510.16790】Unsupervised Monocular Road Segmentation for Autonomous Driving via Scene Geometry

链接https://arxiv.org/abs/2510.16790

作者:Sara Hatami Rostami,Behrooz Nasihatkon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:costly manually labeled, manually labeled datasets, fully unsupervised approach, eliminating the reliance, paper presents

备注: 7 pages, 3 figures

点击查看摘要

Abstract:This paper presents a fully unsupervised approach for binary road segmentation (road vs. non-road), eliminating the reliance on costly manually labeled datasets. The method leverages scene geometry and temporal cues to distinguish road from non-road regions. Weak labels are first generated from geometric priors, marking pixels above the horizon as non-road and a predefined quadrilateral in front of the vehicle as road. In a refinement stage, temporal consistency is enforced by tracking local feature points across frames and penalizing inconsistent label assignments using mutual information maximization. This enhances both precision and temporal stability. On the Cityscapes dataset, the model achieves an Intersection-over-Union (IoU) of 0.82, demonstrating high accuracy with a simple design. These findings demonstrate the potential of combining geometric constraints and temporal consistency for scalable unsupervised road segmentation in autonomous driving.

126. 【2510.16785】Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

链接https://arxiv.org/abs/2510.16785

作者:Jiazhen Liu,Long Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Integrating diverse visual, Multimodal Large, Large Language

备注

点击查看摘要

Abstract:Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

127. 【2510.16781】Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

链接https://arxiv.org/abs/2510.16781

作者:Shihao Ji,Zihui Song

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Visual Language Models, large-scale Visual Language, Language Models, Visual Language, remarkable zero-shot reasoning

备注

点击查看摘要

Abstract:The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

128. 【2510.16777】GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation

链接https://arxiv.org/abs/2510.16777

作者:Junbo Li,Weimin Yuan,Yinuo Wang,Yue Zeng,Shihao Shu,Cai Meng,Xiangzhi Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current research typically, research typically predicts, computer vision, fundamental task, task in computer

备注

点击查看摘要

Abstract:Accurate 6D pose estimation of 3D objects is a fundamental task in computer vision, and current research typically predicts the 6D pose by establishing correspondences between 2D image features and 3D model features. However, these methods often face difficulties with textureless objects and varying illumination conditions. To overcome these limitations, we propose GS2POSE, a novel approach for 6D object pose estimation. GS2POSE formulates a pose regression algorithm inspired by the principles of Bundle Adjustment (BA). By leveraging Lie algebra, we extend the capabilities of 3DGS to develop a pose-differentiable rendering pipeline, which iteratively optimizes the pose by comparing the input image to the rendered image. Additionally, GS2POSE updates color parameters within the 3DGS model, enhancing its adaptability to changes in illumination. Compared to previous models, GS2POSE demonstrates accuracy improvements of 1.4\%, 2.8\% and 2.5\% on the T-LESS, LineMod-Occlusion and LineMod datasets, respectively.

129. 【2510.16776】EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation

链接https://arxiv.org/abs/2510.16776

作者:Mingzheng Zhang,Jinfeng Gao,Dan Xu,Jiangrui Yu,Yuhan Qiao,Lan Chen,Jin Tang,Xiao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:patient wait times, significantly reduce diagnostic, reduce diagnostic burdens, Large Language Models, wait times

备注

点击查看摘要

Abstract:X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on this https URL.

130. 【2510.16772】Region in Context: Text-condition Image editing with Human-like semantic reasoning

链接https://arxiv.org/abs/2510.16772

作者:Thuy Phuong Vu,Dinh-Cuong Hoang,Minhhuy Le,Phan Xuan Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made significant progress, Recent research, based on text, research has made, made significant

备注

点击查看摘要

Abstract:Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: this https URL

131. 【2510.16765】WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement

链接https://arxiv.org/abs/2510.16765

作者:Shengyu Zhu, Fan,Fuxuan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frameworks demonstrate significant, significant computational efficiency, demonstrate significant computational, CNN-based frameworks demonstrate, computer vision

备注: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Oral

点击查看摘要

Abstract:Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.

132. 【2510.16756】End-to-end Listen, Look, Speak and Act

链接https://arxiv.org/abs/2510.16756

作者:Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Lu Lu,Chao Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Audio and Speech Processing (eess.AS)

关键词:models simulating humans, fluidly adapt, Human interaction, speak while acting, inherently multimodal

备注: 22 pages, 8 figures

点击查看摘要

Abstract:Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

133. 【2510.16752】Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution

链接https://arxiv.org/abs/2510.16752

作者:Ivan Molodetskikh,Kirill Malyshev,Mark Mirgaleev,Nikita Zagainov,Evgeney Bogatyrev,Dmitriy Vatolin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Generative image super-resolution, Generative image, rapidly advancing, advancing in visual, detail restoration

备注

点击查看摘要

Abstract:Generative image super-resolution (SR) is rapidly advancing in visual quality and detail restoration. As the capacity of SR models expands, however, so does their tendency to produce artifacts: incorrect, visually disturbing details that reduce perceived quality. Crucially, their perceptual impact varies: some artifacts are barely noticeable while others strongly degrade the image. We argue that artifacts should be characterized by their prominence to human observers rather than treated as uniform binary defects. Motivated by this, we present a novel dataset of 1302 artifact examples from 11 contemporary image-SR methods, where each artifact is paired with a crowdsourced prominence score. Building on this dataset, we train a lightweight regressor that produces spatial prominence heatmaps and outperforms existing methods at detecting prominent artifacts. We release the dataset and code to facilitate prominence-aware evaluation and mitigation of SR artifacts.

134. 【2510.16751】Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

链接https://arxiv.org/abs/2510.16751

作者:Erik Riise,Mehmet Onurcan Kaya,Dim P. Papadopoulos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:revolutionized Large Language, Large Language Models, Large Language, revolutionized Large, Language Models

备注

点击查看摘要

Abstract:While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

135. 【2510.16732】A Comprehensive Survey on World Models for Embodied AI

链接https://arxiv.org/abs/2510.16732

作者:Xinqing Li,Xin He,Le Zhang,Yun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:actions reshape future, future world states, reshape future world, Spatial Latent Grid, agents that perceive

备注: [this https URL](https://github.com/Li-Zn-H/AwesomeWorldModels)

点击查看摘要

Abstract:Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three-axis taxonomy encompassing: (1) Functionality, Decision-Coupled vs. General-Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state-level understanding, and task performance. Furthermore, we offer a quantitative comparison of state-of-the-art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade-off between model performance and the computational efficiency required for real-time control, and the core modeling difficulty of achieving long-horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at this https URL.

136. 【2510.16730】UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid

链接https://arxiv.org/abs/2510.16730

作者:Tianyang Dou,Ming Li,Jiangying Qin,Xuan Liao,Jiageng Zhong,Armin Gruen,Mengyi Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Allen Coral Atlas, Allen Coral, Coral Atlas, coral reef distri-bution, require accurate large-scale

备注

点击查看摘要

Abstract:Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.

137. 【2510.16729】Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models

链接https://arxiv.org/abs/2510.16729

作者:Jianbiao Mei,Yu Yang,Xuemeng Yang,Licheng Wen,Jiajun Lv,Botian Shi,Yong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving systems, driving systems increasingly, systems increasingly rely, vision-centric world models, Residual World Model

备注

点击查看摘要

Abstract:End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

138. 【2510.16714】Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

链接https://arxiv.org/abs/2510.16714

作者:Xiongkun Linghu,Jiangyong Huang,Ziyu Zhu,Baoxiong Jia,Siyuan Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, Existing research, primarily due

备注: Project page: [this https URL](https://scenecot.github.io/)

点击查看摘要

Abstract:Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

139. 【2510.16709】HumanCM: One Step Human Motion Prediction

链接https://arxiv.org/abs/2510.16709

作者:Liu Haojie,Gao Suixiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:prediction framework built, one-step human motion, human motion prediction, one-step human, built upon consistency

备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We present HumanCM, a one-step human motion prediction framework built upon consistency models. Instead of relying on multi-step denoising as in diffusion-based methods, HumanCM performs efficient single-step generation by learning a self-consistent mapping between noisy and clean motion states. The framework adopts a Transformer-based spatiotemporal architecture with temporal embeddings to model long-range dependencies and preserve motion coherence. Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves comparable or superior accuracy to state-of-the-art diffusion models while reducing inference steps by up to two orders of magnitude.

140. 【2510.16704】Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization

链接https://arxiv.org/abs/2510.16704

作者:Tianxin Wei,Yifan Chen,Xinrui He,Wenxuan Bao,Jingrui He

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Distribution shifts, testing samples frequently, samples frequently occur, impede model generalization, shifts between training

备注: Accepted by KDD 2025

点击查看摘要

Abstract:Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through this https URL

141. 【2510.16702】SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation

链接https://arxiv.org/abs/2510.16702

作者:Huy Minh Nhat Nguyen,Triet Hoang Minh Dao,Chau Vinh Hoang Truong,Cuong Tuan Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Optical Coherence Tomography, Optical Coherence, Coherence Tomography, detailed three-dimensional views, noisy OCT images

备注: 2025 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) is a widely used non-invasive imaging technique that provides detailed three-dimensional views of the retina, which are essential for the early and accurate diagnosis of ocular diseases. Consequently, OCT image analysis and processing have emerged as key research areas in biomedical imaging. However, acquiring paired datasets of clean and real-world noisy OCT images for supervised denoising models remains a formidable challenge due to intrinsic speckle noise and practical constraints in clinical imaging environments. To address these issues, we propose SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation. Our novel approach leverages only noisy OCT images by first generating pseudo-ground-truth images through self-fusion and self-supervised denoising. These refined images then serve as targets to train an ensemble of denoising models using a patch-based strategy that effectively enhances image clarity. Performance improvements are validated via metrics such as Contrast-to-Noise Ratio (CNR), Mean Square Ratio (MSR), Texture Preservation (TP), and Edge Preservation (EP) on the real-world dataset from the IEEE SPS Video and Image Processing Cup. Notably, the VIP Cup dataset contains only real-world noisy OCT images without clean references, highlighting our method's potential for improving image quality and diagnostic outcomes in clinical practice.

142. 【2510.16688】Pursuing Minimal Sufficiency in Spatial Reasoning

链接https://arxiv.org/abs/2510.16688

作者:Yejie Guo,Yunzhong Hou,Wufei Ma,Meng Tang,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Minimal Sufficient Spatial, Minimal Sufficient Set, Sufficient Spatial Reasoner, remains a persistent, ability to ground

备注

点击查看摘要

Abstract:Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at this https URL.

143. 【2510.16684】Filtering of Small Components for Isosurface Generation

链接https://arxiv.org/abs/2510.16684

作者:Devin Zhao,Rephael Wenger

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:scalar field, mathbb, sigma, Abstract, rightarrow

备注: 8 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Let $f: \mathbb{R}^3 \rightarrow \mathbb{R}$ be a scalar field. An isosurface is a piecewise linear approximation of a level set $f^{-1}(\sigma)$ for some $\sigma \in \mathbb{R}$ built from some regular grid sampling of $f$. Isosurfaces constructed from scanned data such as CT scans or MRIs often contain extremely small components that distract from the visualization and do not form part of any geometric model produced from the data. Simple prefiltering of the data can remove such small components while having no effect on the large components that form the body of the visualization. We present experimental results on such filtering.

144. 【2510.16664】HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications

链接https://arxiv.org/abs/2510.16664

作者:Christopher Thirgood,Oscar Mendez,Erin Ling,Jon Storey,Simon Hadfield

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Spectral Reconstruction, Toggle, spectral Reconstruction Architecture, Toggle Hugging Face, Code Toggle Papers

备注

点击查看摘要

Abstract:Hyperspectral images (HSI) promise to support a range of new applications in computer vision. Recent research has explored the feasibility of generalizable Spectral Reconstruction (SR), the problem of recovering a HSI from a natural three-channel color image in unseen scenarios. However, previous Multi-Scale Attention (MSA) works have only demonstrated sufficient generalizable results for very sparse spectra, while modern HSI sensors contain hundreds of channels. This paper introduces a novel approach to spectral reconstruction via our HYbrid knowledge Distillation and spectral Reconstruction Architecture (HYDRA). Using a Teacher model that encapsulates latent hyperspectral image data and a Student model that learns mappings from natural images to the Teacher's encoded domain, alongside a novel training method, we achieve high-quality spectral reconstruction. This addresses key limitations of prior SR models, providing SOTA performance across all metrics, including an 18\% boost in accuracy, and faster inference times than current SOTA models at various channel depths.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.16664 [cs.CV]

(or
arXiv:2510.16664v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.16664

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Christopher Thirgood [view email] [v1]
Sat, 18 Oct 2025 23:29:30 UTC (3,044 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications, by Christopher Thirgood and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2025-10

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

145. 【2510.16660】Universal and Transferable Attacks on Pathology Foundation Models

链接https://arxiv.org/abs/2510.16660

作者:Yuntian Wang,Xilin Yang,Che-Yung Shen,Nir Pillar,Aydogan Ozcan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

关键词:Universal and Transferable, Transferable Adversarial Perturbations, pathology foundation models, introduce Universal, foundation models

备注: 38 Pages, 8 Figures

点击查看摘要

Abstract:We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.

146. 【2510.16643】Structured Interfaces for Automated Reasoning with 3D Scene Graphs

链接https://arxiv.org/abs/2510.16643

作者:Aaron Ray,Jacob Arkin,Harel Biggie,Chuchu Fan,Luca Carlone,Nicholas Roy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:robot underlying representations, natural language inputs, user natural language, natural language, robot underlying

备注: 25 pages, 3 figures

点击查看摘要

Abstract:In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at this https URL.

147. 【2510.16641】MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

链接https://arxiv.org/abs/2510.16641

作者:Young-Jun Lee,Byung-Kwan Lee,Jianshu Zhang,Yechan Hwang,Byungsoo Ko,Han-Gyu Kim,Dongyu Yao,Xuankun Rong,Eojin Joo,Seung-Ho Han,Bowon Ko,Ho-Jin Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown impressive capabilities, shown impressive, impressive capabilities, capabilities on single-turn, real-world applications

备注: Project website: [this https URL](https://passing2961.github.io/multiverse-project-page/)

点击查看摘要

Abstract:Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.

148. 【2510.16624】Self-Supervised Learning to Fly using Efficient Semantic Segmentation and Metric Depth Estimation for Low-Cost Autonomous UAVs

链接https://arxiv.org/abs/2510.16624

作者:Sebastian Mocanu,Emil Slusanschi,Marius Leordeanu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:small UAVs operating, paper presents, presents a vision-only, small UAVs, UAVs operating

备注

点击查看摘要

Abstract:This paper presents a vision-only autonomous flight system for small UAVs operating in controlled indoor environments. The system combines semantic segmentation with monocular depth estimation to enable obstacle avoidance, scene exploration, and autonomous safe landing operations without requiring GPS or expensive sensors such as LiDAR. A key innovation is an adaptive scale factor algorithm that converts non-metric monocular depth predictions into accurate metric distance measurements by leveraging semantic ground plane detection and camera intrinsic parameters, achieving a mean distance error of 14.4 cm. The approach uses a knowledge distillation framework where a color-based Support Vector Machine (SVM) teacher generates training data for a lightweight U-Net student network (1.6M parameters) capable of real-time semantic segmentation. For more complex environments, the SVM teacher can be replaced with a state-of-the-art segmentation model. Testing was conducted in a controlled 5x4 meter laboratory environment with eight cardboard obstacles simulating urban structures. Extensive validation across 30 flight tests in a real-world environment and 100 flight tests in a digital-twin environment demonstrates that the combined segmentation and depth approach increases the distance traveled during surveillance and reduces mission time while maintaining 100% success rates. The system is further optimized through end-to-end learning, where a compact student neural network learns complete flight policies from demonstration data generated by our best-performing method, achieving an 87.5% autonomous mission success rate. This work advances practical vision-based drone navigation in structured environments, demonstrating solutions for metric depth estimation and computational efficiency challenges that enable deployment on resource-constrained platforms.

149. 【2510.16611】A Deep Learning Framework for Real-Time Image Processing in Medical Diagnostics: Enhancing Accuracy and Speed in Clinical Applications

链接https://arxiv.org/abs/2510.16611

作者:Melika Filvantorkaman,Maral Filvan Torkaman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:interpreting high-resolution radiological, high-resolution radiological data, radiological data remains, data remains time-consuming, Medical imaging plays

备注: 20 pages, 4 figures

点击查看摘要

Abstract:Medical imaging plays a vital role in modern diagnostics; however, interpreting high-resolution radiological data remains time-consuming and susceptible to variability among clinicians. Traditional image processing techniques often lack the precision, robustness, and speed required for real-time clinical use. To overcome these limitations, this paper introduces a deep learning framework for real-time medical image analysis designed to enhance diagnostic accuracy and computational efficiency across multiple imaging modalities, including X-ray, CT, and MRI. The proposed system integrates advanced neural network architectures such as U-Net, EfficientNet, and Transformer-based models with real-time optimization strategies including model pruning, quantization, and GPU acceleration. The framework enables flexible deployment on edge devices, local servers, and cloud infrastructures, ensuring seamless interoperability with clinical systems such as PACS and EHR. Experimental evaluations on public benchmark datasets demonstrate state-of-the-art performance, achieving classification accuracies above 92%, segmentation Dice scores exceeding 91%, and inference times below 80 milliseconds. Furthermore, visual explanation tools such as Grad-CAM and segmentation overlays enhance transparency and clinical interpretability. These results indicate that the proposed framework can substantially accelerate diagnostic workflows, reduce clinician workload, and support trustworthy AI integration in time-critical healthcare environments.

150. 【2510.16598】VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

链接https://arxiv.org/abs/2510.16598

作者:Jiaying Zhu,Yurui Zhu,Xin Lu,Wenrui Yan,Dong Li,Kunlin Liu,Xueyang Fu,Zheng-Jun Zha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注: 22 pages, 8 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at this https URL .

151. 【2510.16596】SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense

链接https://arxiv.org/abs/2510.16596

作者:Yiyang Huang,Liang Shi,Yitian Zhang,Yi Xu,Yun Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, Large Vision-Language, diverse cross-modal tasks, cross-modal tasks, Vision-Language Models

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.

152. 【2510.16581】Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries

链接https://arxiv.org/abs/2510.16581

作者:Xinfeng Li,Shengyuan Pang,Jialin Wu,Jiangyi Deng,Huanlong Zhong,Yanjiao Chen,Jie Zhang,Wenyuan Xu

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibiting remarkable creativity, produce unsafe images, exhibiting remarkable, remarkable creativity, exploited to produce

备注: 14 pages, 18 figures, 7 tables

点击查看摘要

Abstract:Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.

153. 【2510.16556】Fit for Purpose? Deepfake Detection in the Real World

链接https://arxiv.org/abs/2510.16556

作者:Guangyu Lin,Li Lin,Christina P. Walker,Daniel S. Schiff,Shu Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generative adversarial networks, multimodal large language, large language models, political deepfakes, real-world political deepfakes

备注

点击查看摘要

Abstract:The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.

154. 【2510.16541】Watch Where You Move: Region-aware Dynamic Aggregation and Excitation for Gait Recognition

链接https://arxiv.org/abs/2510.16541

作者:Binyuan Huang,Yongdong Luo,Xianda Guo,Xiawu Zheng,Zheng Zhu,Jiahui Pan,Chengju Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Deep learning-based gait, achieved great success, Region-aware Dynamic Aggregation, Deep learning-based, learning-based gait recognition

备注

点击查看摘要

Abstract:Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.

155. 【2510.16540】Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

链接https://arxiv.org/abs/2510.16540

作者:Jihoon Kwon,Kyle Min,Jy-yong Sohn

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:vision-language models trained, understand structured relationships, standard contrastive objectives, recent advances, linguistic elements

备注: Accepted at NeurIPS 2025 (poster). This is the camera-ready version

点击查看摘要

Abstract:Despite recent advances, vision-language models trained with standard contrastive objectives still struggle with compositional reasoning -- the ability to understand structured relationships between visual and linguistic elements. This shortcoming is largely due to the tendency of the text encoder to focus on individual words rather than their relations, a limitation reinforced by contrastive training that primarily aligns words with visual objects. In this paper, we introduce REconstruction and Alignment of text Descriptions (READ), a fine-tuning method designed to enhance compositional reasoning by adding two auxiliary objectives to the contrastive learning: (1) a token-level reconstruction objective, where a frozen pre-trained decoder reconstructs alternative captions based on the embedding of the original caption; and (2) a sentence-level alignment objective, which explicitly aligns paraphrased sentences in the embedding space. We show that READ-CLIP, a model derived by applying the READ method to the pre-trained CLIP model, achieves the state-of-the-art performance across five major compositional reasoning benchmarks, outperforming the strongest conventional fine-tuning baseline by up to 4.1%. Furthermore, applying the READ to existing CLIP variants (including NegCLIP and FSC-CLIP) also improves performance on these benchmarks. Quantitative and qualitative analyses reveal that our proposed objectives -- reconstruction and alignment -- offer complementary benefits: the former encourages the encoder to capture relationships between words within a caption, while the latter ensures consistent representations for paraphrases expressed with different wording.

156. 【2510.16514】Image Categorization and Search via a GAT Autoencoder and Representative Models

链接https://arxiv.org/abs/2510.16514

作者:Duygu Sap,Martin Lotz,Connor Mattinson

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:graph attention network, categorization and retrieval, attention network, propose a method, image

备注: 10 pages, 22 figures, Under review

点击查看摘要

Abstract:We propose a method for image categorization and retrieval that leverages graphs and a graph attention network (GAT)-based autoencoder. Our approach is representative-centric, that is, we execute the categorization and retrieval process via the representative models we construct for the images and image categories. We utilize a graph where nodes represent images (or their representatives) and edges capture similarity relationships. GAT highlights important features and relationships between images, enabling the autoencoder to construct context-aware latent representations that capture the key features of each image relative to its neighbors. We obtain category representatives from these embeddings and categorize a query image by comparing its representative to the category representatives. We then retrieve the most similar image to the query image within its identified category. We demonstrate the effectiveness of our representative-centric approach through experiments with both the GAT autoencoders and standard feature-based techniques.

157. 【2510.16508】OOS-DSD: Improving Out-of-stock Detection in Retail Images using Auxiliary Tasks

链接https://arxiv.org/abs/2510.16508

作者:Franko Šikić,Sven Lončarić

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:important retail verification, SOTA OOS detection, OOS detection, retail verification process, OOS detection methods

备注

点击查看摘要

Abstract:Out-of-stock (OOS) detection is a very important retail verification process that aims to infer the unavailability of products in their designated areas on the shelf. In this paper, we introduce OOS-DSD, a novel deep learning-based method that advances OOS detection through auxiliary learning. In particular, we extend a well-established YOLOv8 object detection architecture with additional convolutional branches to simultaneously detect OOS, segment products, and estimate scene depth. While OOS detection and product segmentation branches are trained using ground truth data, the depth estimation branch is trained using pseudo-labeled annotations produced by the state-of-the-art (SOTA) depth estimation model Depth Anything V2. Furthermore, since the aforementioned pseudo-labeled depth estimates display relative depth, we propose an appropriate depth normalization procedure that stabilizes the training process. The experimental results show that the proposed method surpassed the performance of the SOTA OOS detection methods by 1.8% of the mean average precision (mAP). In addition, ablation studies confirm the effectiveness of auxiliary learning and the proposed depth normalization procedure, with the former increasing mAP by 3.7% and the latter by 4.2%.

158. 【2510.16505】PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

链接https://arxiv.org/abs/2510.16505

作者:Lukas Selch,Yufang Hou,M. Jehanzeb Mirza,Sivan Doveh,James Glass,Rogerio Feris,Wei Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly applied, remains unclear, reliably understand, Large Multimodal Models, Multimodal Models

备注

点击查看摘要

Abstract:Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

159. 【2510.16463】HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars

链接https://arxiv.org/abs/2510.16463

作者:Haocheng Tang,Ruoke Yan,Xinhui Yin,Qi Zhang,Xinfeng Zhang,Siwei Ma,Wen Gao,Chuanmin Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:showing strong potential, Recent advances, Gaussian Splatting, enabled fast, showing strong

备注: ACM International Conference on Multimedia 2025

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.

160. 【2510.16457】NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

链接https://arxiv.org/abs/2510.16457

作者:Peiran Xu,Xicheng Gong,Yadong MU

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:work we concentrate, make decisions based, future, goal-oriented VLN datasets, Abstract

备注: ICCV 2025

点击查看摘要

Abstract:In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.

161. 【2510.16450】Instance-Aware Pseudo-Labeling and Class-Focused Contrastive Learning for Weakly Supervised Domain Adaptive Segmentation of Electron Microscopy

链接https://arxiv.org/abs/2510.16450

作者:Shan Xiong,Jiabao Chen,Ye Wang,Jialin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:numerous mitochondria instances, Annotation-efficient segmentation, electron microscopy, neuroscience research, numerous mitochondria

备注

点击查看摘要

Abstract:Annotation-efficient segmentation of the numerous mitochondria instances from various electron microscopy (EM) images is highly valuable for biological and neuroscience research. Although unsupervised domain adaptation (UDA) methods can help mitigate domain shifts and reduce the high costs of annotating each domain, they typically have relatively low performance in practical applications. Thus, we investigate weakly supervised domain adaptation (WDA) that utilizes additional sparse point labels on the target domain, which require minimal annotation effort and minimal expert knowledge. To take full use of the incomplete and imprecise point annotations, we introduce a multitask learning framework that jointly conducts segmentation and center detection with a novel cross-teaching mechanism and class-focused cross-domain contrastive learning. While leveraging unlabeled image regions is essential, we introduce segmentation self-training with a novel instance-aware pseudo-label (IPL) selection strategy. Unlike existing methods that typically rely on pixel-wise pseudo-label filtering, the IPL semantically selects reliable and diverse pseudo-labels with the help of the detection task. Comprehensive validations and comparisons on challenging datasets demonstrate that our method outperforms existing UDA and WDA methods, significantly narrowing the performance gap with the supervised upper bound. Furthermore, under the UDA setting, our method also achieves substantial improvements over other UDA techniques.

162. 【2510.16446】VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion

链接https://arxiv.org/abs/2510.16446

作者:Jaekyun Park,Hye Won Chung

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:fully fine-tuning pretrained, large-scale foundation models, fine-tuning pretrained networks, fully fine-tuning, prohibitively resource-intensive

备注: NeurIPS 2025

点击查看摘要

Abstract:In the era of large-scale foundation models, fully fine-tuning pretrained networks for each downstream task is often prohibitively resource-intensive. Prompt tuning offers a lightweight alternative by introducing tunable prompts while keeping the backbone frozen. However, existing visual prompt tuning methods often fail to specialize the prompts or enrich the representation space--especially when applied to self-supervised backbones. We show that these limitations become especially pronounced in challenging tasks and data-scarce settings, where effective adaptation is most critical. In this work, we introduce VIPAMIN, a visual prompt initialization strategy that enhances adaptation of self-supervised models by (1) aligning prompts with semantically informative regions in the embedding space, and (2) injecting novel representational directions beyond the pretrained subspace. Despite its simplicity--requiring only a single forward pass and lightweight operations--VIPAMIN consistently improves performance across diverse tasks and dataset sizes, setting a new state of the art in visual prompt tuning. Our code is available at this https URL.

163. 【2510.16445】Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance

链接https://arxiv.org/abs/2510.16445

作者:Chien Thai,Mai Xuan Trang,Huong Ninh,Hoang Hiep Ly,Anh Son Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Detecting rotated objects, Detecting rotated, remote sensing, computer vision, aerial imagery

备注: Neurocomputing

点击查看摘要

Abstract:Detecting rotated objects accurately and efficiently is a significant challenge in computer vision, particularly in applications such as aerial imagery, remote sensing, and autonomous driving. Although traditional object detection frameworks are effective for axis-aligned objects, they often underperform in scenarios involving rotated objects due to their limitations in capturing orientation variations. This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness by leveraging the Gaussian bounding box representation and Bhattacharyya distance. In addition, we advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects. Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects. We integrate this proposed loss function into state-of-the-art deep learning-based rotated object detection detectors, and extensive experiments demonstrated significant improvements in mean Average Precision metrics compared to existing methods. The results highlight the potential of our approach to establish new benchmark in rotated object detection, with implications for a wide range of applications requiring precise and reliable object localization irrespective of orientation.

164. 【2510.16444】RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

链接https://arxiv.org/abs/2510.16444

作者:Kunyu Peng,Di Wen,Jia Fu,Jiamin Wu,Kailun Yang,Junwei Zheng,Ruiping Liu,Yufan Chen,Yuqian Fu,Danda Pani Paudel,Luc Van Gool,Rainer Stiefelhagen

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:Referring Atomic Video, natural language descriptions, Referring Atomic, Video Action Recognition, RAVAR emphasizes precise

备注: Extended version of ECCV 2024 paper [arXiv:2407.01872](https://arxiv.org/abs/2407.01872) . The dataset and code are released at [this https URL](https://github.com/KPeng9510/refAVA2)

点击查看摘要

Abstract:Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises 2.9 million frames and 75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at this https URL.

165. 【2510.16442】EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

链接https://arxiv.org/abs/2510.16442

作者:Haoran Sun,Chen Cai,Huiping Zhuang,Kong Aik Lee,Lap-Pui Chau,Yi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:facilitated artistic creation, deepfake video technology, deepfake video detection, spread misinformation, rapid development

备注

点击查看摘要

Abstract:The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.

166. 【2510.16438】LightGlueStick: a Fast and Robust Glue for Joint Point-Line Matching

链接https://arxiv.org/abs/2510.16438

作者:Aidyn Ubingazhibov,Rémi Pautrat,Iago Suárez,Shaohui Liu,Marc Pollefeys,Viktor Larsson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complementary local features, combination has proven, proven effective, local feature matchers, complementary local

备注: Accepted at ICCVW 2025

点击查看摘要

Abstract:Lines and points are complementary local features, whose combination has proven effective for applications such as SLAM and Structure-from-Motion. The backbone of these pipelines are the local feature matchers, establishing correspondences across images. Traditionally, point and line matching have been treated as independent tasks. Recently, GlueStick proposed a GNN-based network that simultaneously operates on points and lines to establish matches. While running a single joint matching reduced the overall computational complexity, the heavy architecture prevented real-time applications or deployment to edge devices. Inspired by recent progress in point matching, we propose LightGlueStick, a lightweight matcher for points and line segments. The key novel component in our architecture is the Attentional Line Message Passing (ALMP), which explicitly exposes the connectivity of the lines to the network, allowing for efficient communication between nodes. In thorough experiments we show that LightGlueStick establishes a new state-of-the-art across different benchmarks. The code is available at this https URL.

Comments:
Accepted at ICCVW 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.16438 [cs.CV]

(or
arXiv:2510.16438v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.16438

Focus to learn more

              arXiv-issued DOI via DataCite</p>
167. 【2510.16416】SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

链接https://arxiv.org/abs/2510.16416

作者:Xiaojun Guo,Runyu Zhou,Yifei Wang,Qi Zhang,Chenheng Zhang,Stefanie Jegelka,Xiaohan Wang,Jiajun Chai,Guojun Yin,Wei Lin,Yisen Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:shown remarkable abilities, integrating large language, large language models, shown remarkable, remarkable abilities

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

168. 【2510.16410】REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

链接https://arxiv.org/abs/2510.16410

作者:Changyue Shi,Minghao Chen,Yiping Mao,Chuxiao Yang,Xinyuan Hu,Jiajun Ding,Zhou Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Bridging the gap, complex human instructions, vision and robotics, gap between complex, complex human

备注

点击查看摘要

Abstract:Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: this https URL.

169. 【2510.16396】SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation

链接https://arxiv.org/abs/2510.16396

作者:Yeh Keng Hao,Hsu Tzu Wei,Sun Min

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deep learning models, critical challenge, increasing ubiquity, deployment of deep, deep learning

备注: Accepted to AICCC 2025

点击查看摘要

Abstract:With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.

170. 【2510.16377】Demeter: A Parametric Model of Crop Plant Morphology from the Real World

链接https://arxiv.org/abs/2510.16377

作者:Tianhang Cheng,Albert J. Zhai,Evan Z. Chen,Rui Zhou,Yawen Deng,Zitong Li,Kejie Zhao,Janice Shiu,Qianyu Zhao,Yide Xu,Xinlei Wang,Yuan Shen,Sheng Wang,Lisa Ainsworth,Kaiyu Guan,Shenlong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:showed broad utility, objects has gained, gained popularity, popularity in vision, vision and graphics

备注: ICCV 2025

点击查看摘要

Abstract:Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data is available at this https URL.

171. 【2510.16375】WatchRoadv2: Pothole Detection, Geospatial Mapping, and Intelligent Road Governance

链接https://arxiv.org/abs/2510.16375

作者:Rishi Raj Sahoo,Surbhi Saswati Mohanty,Subhankar Mishra

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:pose significant safety, significant safety hazards, under-maintained road networks, potholes pose significant, Road potholes pose

备注: Under review

点击查看摘要

Abstract:Road potholes pose significant safety hazards and maintenance challenges, particularly on India's diverse and under-maintained road networks. This paper presents iWatchRoadv2, a fully automated end-to-end platform for real-time pothole detection, GPS-based geotagging, and dynamic road health visualization using OpenStreetMap (OSM). We curated a self-annotated dataset of over 7,000 dashcam frames capturing diverse Indian road conditions, weather patterns, and lighting scenarios, which we used to fine-tune the Ultralytics YOLO model for accurate pothole detection. The system synchronizes OCR-extracted video timestamps with external GPS logs to precisely geolocate each detected pothole, enriching detections with comprehensive metadata, including road segment attribution and contractor information managed through an optimized backend database. iWatchRoadv2 introduces intelligent governance features that enable authorities to link road segments with contract metadata through a secure login interface. The system automatically sends alerts to contractors and officials when road health deteriorates, supporting automated accountability and warranty enforcement. The intuitive web interface delivers actionable analytics to stakeholders and the public, facilitating evidence-driven repair planning, budget allocation, and quality assessment. Our cost-effective and scalable solution streamlines frame processing and storage while supporting seamless public engagement for urban and rural deployments. By automating the complete pothole monitoring lifecycle, from detection to repair verification, iWatchRoadv2 enables data-driven smart city management, transparent governance, and sustainable improvements in road infrastructure maintenance. The platform and live demonstration are accessible at this https URL.

172. 【2510.16371】Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

链接https://arxiv.org/abs/2510.16371

作者:Mohammad Javad Ahmadi,Iman Gandomi,Parisa Abdi,Seyed-Farzad Mohammadi,Amirhossein Taslimi,Mehdi Khodaparast,Hassan Hashemi,Mahdi Tavakoli,Hamid D. Taghirad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:computer-assisted surgery systems, surgery systems depends, depends on large-scale, development of computer-assisted, systems depends

备注: 20 pages, 11 figures, 11 tables. Data descriptor for the Cataract-LMM benchmark dataset. Source code and dataset are available

点击查看摘要

Abstract:The development of computer-assisted surgery systems depends on large-scale, annotated datasets. Current resources for cataract surgery often lack the diversity and annotation depth needed to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels. This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on the established competency rubrics like the ICO-OSCAR. The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks, including workflow recognition, scene segmentation, and automated skill assessment. Furthermore, we establish a domain adaptation baseline for the phase recognition task by training a model on a subset of surgical centers and evaluating its performance on a held-out center. The dataset and annotations are available in Google Form (this https URL).

173. 【2510.16370】MIRAD - A comprehensive real-world robust anomaly detection dataset for Mass Individualization

链接https://arxiv.org/abs/2510.16370

作者:Pulin Li,Guocheng Wu,Li Yin,Yuxin Zheng,Wei Zhang,Yanjie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leverages community collaboration, manufacturing leverages community, realize mass individualization, leverages community, community collaboration

备注: [this https URL](https://github.com/wu33learn/MIRAD)

点击查看摘要

Abstract:Social manufacturing leverages community collaboration and scattered resources to realize mass individualization in modern industry. However, this paradigm shift also introduces substantial challenges in quality control, particularly in defect detection. The main difficulties stem from three aspects. First, products often have highly customized configurations. Second, production typically involves fragmented, small-batch orders. Third, imaging environments vary considerably across distributed sites. To overcome the scarcity of real-world datasets and tailored algorithms, we introduce the Mass Individualization Robust Anomaly Detection (MIRAD) dataset. As the first benchmark explicitly designed for anomaly detection in social manufacturing, MIRAD captures three critical dimensions of this domain: (1) diverse individualized products with large intra-class variation, (2) data collected from six geographically dispersed manufacturing nodes, and (3) substantial imaging heterogeneity, including variations in lighting, background, and motion conditions. We then conduct extensive evaluations of state-of-the-art (SOTA) anomaly detection methods on MIRAD, covering one-class, multi-class, and zero-shot approaches. Results show a significant performance drop across all models compared with conventional benchmarks, highlighting the unresolved complexities of defect detection in real-world individualized production. By bridging industrial requirements and academic research, MIRAD provides a realistic foundation for developing robust quality control solutions essential for Industry 5.0. The dataset is publicly available at this https URL.

174. 【2510.16342】Beyond Fixed Anchors: Precisely Erasing Concepts with Sibling Exclusive Counterparts

链接https://arxiv.org/abs/2510.16342

作者:Tong Zhang,Ru Zhang,Jianyi Liu,Zhen Yang,Gongshen Liu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion models commonly, models commonly rely, Sibling Exclusive Concepts, define Sibling Exclusive, fixed anchor strategies

备注

点击查看摘要

Abstract:Existing concept erasure methods for text-to-image diffusion models commonly rely on fixed anchor strategies, which often lead to critical issues such as concept re-emergence and erosion. To address this, we conduct causal tracing to reveal the inherent sensitivity of erasure to anchor selection and define Sibling Exclusive Concepts as a superior class of anchors. Based on this insight, we propose \textbf{SELECT} (Sibling-Exclusive Evaluation for Contextual Targeting), a dynamic anchor selection framework designed to overcome the limitations of fixed anchors. Our framework introduces a novel two-stage evaluation mechanism that automatically discovers optimal anchors for precise erasure while identifying critical boundary anchors to preserve related concepts. Extensive evaluations demonstrate that SELECT, as a universal anchor solution, not only efficiently adapts to multiple erasure frameworks but also consistently outperforms existing baselines across key performance metrics, averaging only 4 seconds for anchor mining of a single concept.

175. 【2510.16335】On the Provable Importance of Gradients for Language-Assisted Image Clustering

链接https://arxiv.org/abs/2510.16335

作者:Bo Peng,Jie Lu,Guangquan Zhang,Zhen Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Language-assisted Image Clustering, recently emerged problem, facilitate image clustering, Language-assisted Image, problem of Language-assisted

备注: revised and extended version of ICCV2025

点击查看摘要

Abstract:This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.

176. 【2510.16333】RL makes MLLMs see better than SFT

链接https://arxiv.org/abs/2510.16333

作者:Junha Song,Sangdoo Yun,Dongyoon Han,Jaegul Choo,Byeongho Heo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal Language Model, Language Model, Multimodal Language, immense parameter scale, assumption in Multimodal

备注

点击查看摘要

Abstract:A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at this https URL

177. 【2510.16332】okenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

链接https://arxiv.org/abs/2510.16332

作者:Haiyue Sun,Qingdong He,Jinlong Peng,Peng Tang,Jiangning Zhang,Junwei Zhu,Xiaobin Hu,Shuicheng Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown remarkable success, Autoregressive Model, image generation, shown remarkable, remarkable success

备注

点击查看摘要

Abstract:Autoregressive Model (AR) has shown remarkable success in conditional image generation. However, these approaches for multiple reference generation struggle with decoupling different reference identities. In this work, we propose the TokenAR framework, specifically focused on a simple but effective token-level enhancement mechanism to address reference identity confusion problem. Such token-level enhancement consists of three parts, 1). Token Index Embedding clusters the tokens index for better representing the same reference images; 2). Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens; 3). The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each this http URL token-enhancement framework significantly augments the capabilities of existing AR based methods in conditional image generation, enabling good identity consistency while preserving high quality background reconstruction. Driven by the goal of high-quality and high-diversity in multi-subject generation, we introduce the InstructAR Dataset, the first open-source, large-scale, multi-reference input, open domain image generation dataset that includes 28K training pairs, each example has two reference subjects, a relative prompt and a background with mask annotation, curated for multiple reference image generation training and evaluating. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in multiple reference image generation task. The implementation code and datasets will be made publicly. Codes are available, see this https URL

178. 【2510.16326】DiffusionX: Efficient Edge-Cloud Collaborative Image Generation with Multi-Round Prompt Evolution

链接https://arxiv.org/abs/2510.16326

作者:Yi Wei,Shunpu Tang,Liang Zhao,Qiangian Yang(College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China)

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:driven remarkable progress, Recent advances, driven remarkable, remarkable progress, Recent

备注

点击查看摘要

Abstract:Recent advances in diffusion models have driven remarkable progress in image generation. However, the generation process remains computationally intensive, and users often need to iteratively refine prompts to achieve the desired results, further increasing latency and placing a heavy burden on cloud resources. To address this challenge, we propose DiffusionX, a cloud-edge collaborative framework for efficient multi-round, prompt-based generation. In this system, a lightweight on-device diffusion model interacts with users by rapidly producing preview images, while a high-capacity cloud model performs final refinements after the prompt is finalized. We further introduce a noise level predictor that dynamically balances the computation load, optimizing the trade-off between latency and cloud workload. Experiments show that DiffusionX reduces average generation time by 15.8% compared with Stable Diffusion v1.5, while maintaining comparable image quality. Moreover, it is only 0.9% slower than Tiny-SD with significantly improved image quality, thereby demonstrating efficiency and scalability with minimal overhead.

179. 【2510.16325】Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

链接https://arxiv.org/abs/2510.16325

作者:Yuyao Zhang,Yu-Wing Tai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models remain constrained, fine-grained texture synthesis, current diffusion models, diffusion models remain, globally coherent structure

备注: 22 pages

点击查看摘要

Abstract:Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.

180. 【2510.16320】Scaling Laws for Deepfake Detection

链接https://arxiv.org/abs/2510.16320

作者:Wenhao Wang,Longqi Cai,Taihong Xiao,Yuxiao Wang,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paper presents, presents a systematic, systematic study, deepfake detection task, deepfake

备注

点击查看摘要

Abstract:This paper presents a systematic study of scaling laws for the deepfake detection task. Specifically, we analyze the model performance against the number of real image domains, deepfake generation methods, and training images. Since no existing dataset meets the scale requirements for this research, we construct ScaleDF, the largest dataset to date in this field, which contains over 5.8 million real images from 51 different datasets (domains) and more than 8.8 million fake images generated by 102 deepfake methods. Using ScaleDF, we observe power-law scaling similar to that shown in large language models (LLMs). Specifically, the average detection error follows a predictable power-law decay as either the number of real domains or the number of deepfake methods increases. This key observation not only allows us to forecast the number of additional real domains or deepfake methods required to reach a target performance, but also inspires us to counter the evolving deepfake technology in a data-centric manner. Beyond this, we examine the role of pre-training and data augmentations in deepfake detection under scaling, as well as the limitations of scaling itself.

181. 【2510.16319】Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation

链接https://arxiv.org/abs/2510.16319

作者:Rui Yang,Huining Li,Yiyi Long,Xiaojun Wu,Shengfeng He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating sketches guided, styles requires precise, requires precise transfer, preserving semantic structure, reference styles requires

备注: ICCV 2025

点击查看摘要

Abstract:Generating sketches guided by reference styles requires precise transfer of stroke attributes, such as line thickness, deformation, and texture sparsity, while preserving semantic structure and content fidelity. To this end, we propose Stroke2Sketch, a novel training-free framework that introduces cross-image stroke attention, a mechanism embedded within self-attention layers to establish fine-grained semantic correspondences and enable accurate stroke attribute transfer. This allows our method to adaptively integrate reference stroke characteristics into content images while maintaining structural integrity. Additionally, we develop adaptive contrast enhancement and semantic-focused attention to reinforce content preservation and foreground emphasis. Stroke2Sketch effectively synthesizes stylistically faithful sketches that closely resemble handcrafted results, outperforming existing methods in expressive stroke control and semantic coherence. Codes are available at this https URL.

182. 【2510.16295】OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

链接https://arxiv.org/abs/2510.16295

作者:Ryoto Miyamoto,Xin Fan,Fuyuko Kido,Tsuneo Matsumoto,Hayato Yamana

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large vision-language models, highlights fundamental challenges, evaluating membership inference, vision-language models, highlights fundamental

备注

点击查看摘要

Abstract:OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions. By offering a transparent and unbiased benchmark, OpenLVLM-MIA clarifies the current limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.

183. 【2510.16290】Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models

链接https://arxiv.org/abs/2510.16290

作者:Yue Zheng,Xiufang Shi,Jiming Chen,Yuanchao Shu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:rapidly advanced, advanced by recent, recent development, development of Vision-Language, Vision-Language Models

备注

点击查看摘要

Abstract:Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.

184. 【2510.16272】Proactive Scene Decomposition and Reconstruction

链接https://arxiv.org/abs/2510.16272

作者:Baicheng Li,Zike Yan,Dong Wu,Hongbin Zha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human behaviors, inherently contain rich, Human, proactive scene decomposition, rich cues

备注

点击查看摘要

Abstract:Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.

185. 【2510.16263】NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?

链接https://arxiv.org/abs/2510.16263

作者:Jierui Peng,Yanyan Zhang,Yicheng Duan,Tuo Liang,Vipin Chaudhary,Yu Yin

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world perturbations, precise skill diagnosis, precise skill, skill diagnosis, measure robustness

备注: Homepage: [this https URL](https://vulab-ai.github.io/NEBULA-Alpha/)

点击查看摘要

Abstract:The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce NEBULA, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained capability tests for precise skill diagnosis with systematic stress tests that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.

186. 【2510.16258】Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset

链接https://arxiv.org/abs/2510.16258

作者:Claire McLean,Makenzie Meendering,Tristan Swartz,Orri Gabbay,Alexandra Olsen,Rachel Jacobs,Nicholas Rosen,Philippe de Bree,Tony Garcia,Gadsden Merrill,Jake Sandakly,Julia Buffalini,Neham Jain,Steven Krenn,Moneish Kumar,Dejan Markovic,Evonne Ng,Fabian Prada,Andrew Saba,Siwei Zhang,Vasu Agrawal,Tim Godisart,Alexander Richard,Michael Zollhoefer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Codec Avatars Lab, Meta introduces Embody, multi-camera collection stage, Codec Avatars, Avatars Lab

备注

点击查看摘要

Abstract:The Codec Avatars Lab at Meta introduces Embody 3D, a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage, amounting to over 54 million frames of tracked 3D motion. The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion; as well as multi-person behavioral and conversational data like discussions, conversations in different emotional states, collaborative activities, and co-living scenarios in an apartment-like space. We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.

187. 【2510.16235】Designing a Convolutional Neural Network for High-Accuracy Oral Cavity Squamous Cell Carcinoma (OCSCC) Detection

链接https://arxiv.org/abs/2510.16235

作者:Vishal Manikanden,Aniketh Bandlamudi,Daniel Haehn

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Oral Cavity Squamous, Squamous Cell Carcinoma, Cavity Squamous Cell, Oral Cavity, Cell Carcinoma

备注

点击查看摘要

Abstract:Oral Cavity Squamous Cell Carcinoma (OCSCC) is the most common type of head and neck cancer. Due to the subtle nature of its early stages, deep and hidden areas of development, and slow growth, OCSCC often goes undetected, leading to preventable deaths. However, properly trained Convolutional Neural Networks (CNNs), with their precise image segmentation techniques and ability to apply kernel matrices to modify the RGB values of images for accurate image pattern recognition, would be an effective means for early detection of OCSCC. Pairing this neural network with image capturing and processing hardware would allow increased efficacy in OCSCC detection. The aim of our project is to develop a Convolutional Neural Network trained to recognize OCSCC, as well as to design a physical hardware system to capture and process detailed images, in order to determine the image quality required for accurate predictions. A CNN was trained on 4293 training images consisting of benign and malignant tumors, as well as negative samples, and was evaluated for its precision, recall, and Mean Average Precision (mAP) in its predictions of OCSCC. A testing dataset of randomly assorted images of cancerous, non-cancerous, and negative images was chosen, and each image was altered to represent 5 common resolutions. This test data set was thoroughly analyzed by the CNN and predictions were scored on the basis of accuracy. The designed enhancement hardware was used to capture detailed images, and its impact was scored. An application was developed to facilitate the testing process and bring open access to the CNN. Images of increasing resolution resulted in higher-accuracy predictions on a logarithmic scale, demonstrating the diminishing returns of higher pixel counts.

188. 【2510.16220】VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction

链接https://arxiv.org/abs/2510.16220

作者:Djamel Eddine Boukhari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial Beauty Prediction, Beauty Prediction, computer vision task, Convolutional Neural Networks, challenging computer vision

备注

点击查看摘要

Abstract:Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, \textbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a \textbf{Pearson Correlation (PC) of 0.9212}, a \textbf{Mean Absolute Error (MAE) of 0.2085}, and a \textbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model's decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.

189. 【2510.16209】StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

链接https://arxiv.org/abs/2510.16209

作者:Nyle Siddiqui,Rohit Gupta,Sirnam Swetha,Mubarak Shah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:State space models, State space, competitive alternative, State, training

备注

点击查看摘要

Abstract:State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention becomes quadratically expensive. However, current training methods for video understanding are tailored towards transformers and fail to fully leverage the unique attributes of SSMs. For example, video models are often trained at a fixed resolution and video length to balance the quadratic scaling of attention cost against performance. Consequently, these models suffer from degraded performance when evaluated on videos with spatial and temporal resolutions unseen during training; a property we call spatio-temporal inflexibility. In the context of action recognition, this severely limits a model's ability to retain performance across both short- and long-form videos. Therefore, we propose a flexible training method that leverages and improves the inherent adaptability of SSMs. Our method samples videos at varying temporal and spatial resolutions during training and dynamically interpolates model weights to accommodate any spatio-temporal scale. This instills our SSM, which we call StretchySnake, with spatio-temporal flexibility and enables it to seamlessly handle videos ranging from short, fine-grained clips to long, complex activities. We introduce and compare five different variants of flexible training, and identify the most effective strategy for video SSMs. On short-action (UCF-101, HMDB-51) and long-action (COIN, Breakfast) benchmarks, StretchySnake outperforms transformer and SSM baselines alike by up to 28%, with strong adaptability to fine-grained actions (SSV2, Diving-48). Therefore, our method provides a simple drop-in training recipe that makes video SSMs more robust, resolution-agnostic, and efficient across diverse action recognition scenarios.

190. 【2510.16207】Data-Centric AI for Tropical Agricultural Mapping: Challenges, Strategies and Scalable Solutions

链接https://arxiv.org/abs/2510.16207

作者:Mateus Pinto da Silva,Sabrina P. L. P. Correa,Hugo N. Oliveira,Ian M. Nunes,Jefersson A. dos Santos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presents unique challenges, remote sensing presents, sensing presents unique, high-quality annotated data, Data-Centric Artificial Intelligence

备注: 5 pages, 1 figure

点击查看摘要

Abstract:Mapping agriculture in tropical areas through remote sensing presents unique challenges, including the lack of high-quality annotated data, the elevated costs of labeling, data variability, and regional generalisation. This paper advocates a Data-Centric Artificial Intelligence (DCAI) perspective and pipeline, emphasizing data quality and curation as key drivers for model robustness and scalability. It reviews and prioritizes techniques such as confident learning, core-set selection, data augmentation, and active learning. The paper highlights the readiness and suitability of 25 distinct strategies in large-scale agricultural mapping pipelines. The tropical context is of high interest, since high cloudiness, diverse crop calendars, and limited datasets limit traditional model-centric approaches. This tutorial outlines practical solutions as a data-centric approach for curating and training AI models better suited to the dynamic realities of tropical agriculture. Finally, we propose a practical pipeline using the 9 most mature and straightforward methods that can be applied to a large-scale tropical agricultural mapping project.

191. 【2510.16196】Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

链接https://arxiv.org/abs/2510.16196

作者:Zheng Huang,Enpei Zhang,Yinghao Cai,Weikang Qiu,Carl Yang,Elynn Chen,Xiang Zhang,Rex Ying,Dawei Zhou,Yujun Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Magnetic Resonance Imaging, brain encodes visual, encodes visual information, functional Magnetic Resonance, Resonance Imaging

备注

点击查看摘要

Abstract:Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

192. 【2510.16179】Cost Savings from Automatic Quality Assessment of Generated Images

链接https://arxiv.org/abs/2510.16179

作者:Xavier Giro-i-Nieto,Nefeli Andreou,Anqi Liang,Manel Baradad,Francesc Moreno-Noguer,Aleix Martinez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep generative models, shown impressive progress, Deep generative, simple text prompt, recent years

备注

点击查看摘要

Abstract:Deep generative models have shown impressive progress in recent years, making it possible to produce high quality images with a simple text prompt or a reference image. However, state of the art technology does not yet meet the quality standards offered by traditional photographic methods. For this reason, production pipelines that use generated images often include a manual stage of image quality assessment (IQA). This process is slow and expensive, especially because of the low yield of automatically generated images that pass the quality bar. The IQA workload can be reduced by introducing an automatic pre-filtering stage, that will increase the overall quality of the images sent to review and, therefore, reduce the average cost required to obtain a high quality image. We present a formula that estimates the cost savings depending on the precision and pass yield of a generic IQA engine. This formula is applied in a use case of background inpainting, showcasing a significant cost saving of 51.61% obtained with a simple AutoML solution.

193. 【2510.16160】Automated C-Arm Positioning via Conformal Landmark Localization

链接https://arxiv.org/abs/2510.16160

作者:Ahmad Arrabi,Jay Hwasung Jung,Jax Luo,Nathan Franssen,Scott Raymond,Safwan Wshah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reliable C-arm positioning, fluoroscopy-guided interventions, positioning is essential, essential for fluoroscopy-guided, C-arm positioning

备注

点击查看摘要

Abstract:Accurate and reliable C-arm positioning is essential for fluoroscopy-guided interventions. However, clinical workflows rely on manual alignment that increases radiation exposure and procedural delays. In this work, we present a pipeline that autonomously navigates the C-arm to predefined anatomical landmarks utilizing X-ray images. Given an input X-ray image from an arbitrary starting location on the operating table, the model predicts a 3D displacement vector toward each target landmark along the body. To ensure reliable deployment, we capture both aleatoric and epistemic uncertainties in the model's predictions and further calibrate them using conformal prediction. The derived prediction regions are interpreted as 3D confidence regions around the predicted landmark locations. The training framework combines a probabilistic loss with skeletal pose regularization to encourage anatomically plausible outputs. We validate our approach on a synthetic X-ray dataset generated from DeepDRR. Results show not only strong localization accuracy across multiple architectures but also well-calibrated prediction bounds. These findings highlight the pipeline's potential as a component in safe and reliable autonomous C-arm systems. Code is available at this https URL

194. 【2510.16146】DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization

链接https://arxiv.org/abs/2510.16146

作者:Thanh-Huy Nguyen,Hoang-Thien Nguyen,Vi Vu,Ba-Thinh Lam,Phat Huynh,Tianyang Wang,Xingjian Li,Ulas Bagci,Min Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learning increasingly appealing, medical imaging makes, makes semi-supervised learning, semi-supervised learning increasingly, imaging makes semi-supervised

备注: The paper is under review at CMIG

点击查看摘要

Abstract:The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.

195. 【2510.16145】C-arm Guidance: A Self-supervised Approach To Automated Positioning During Stroke Thrombectomy

链接https://arxiv.org/abs/2510.16145

作者:Ahmad Arrabi,Jay hwasung Jung,J Le,A Nguyen,J Reed,E Stahl,Nathan Franssen,Scott Raymond,Safwan Wshah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:resource and personnel-intensive, effective treatments, treatments for ischemic, Abstract, Thrombectomy

备注

点击查看摘要

Abstract:Thrombectomy is one of the most effective treatments for ischemic stroke, but it is resource and personnel-intensive. We propose employing deep learning to automate critical aspects of thrombectomy, thereby enhancing efficiency and safety. In this work, we introduce a self-supervised framework that classifies various skeletal landmarks using a regression-based pretext task. Our experiments demonstrate that our model outperforms existing methods in both regression and classification tasks. Notably, our results indicate that the positional pretext task significantly enhances downstream classification performance. Future work will focus on extending this framework toward fully autonomous C-arm control, aiming to optimize trajectories from the pelvis to the head during stroke thrombectomy procedures. All code used is available at this https URL

196. 【2510.16136】GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer

链接https://arxiv.org/abs/2510.16136

作者:Sayan Deb Sarkar,Sinisa Stekovic,Vincent Lepetit,Iro Armeni

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:digital content creation, garnered interest due, augmented reality, Transferring appearance, industries like gaming

备注: NeurIPS 2025. Project Page: [this https URL](https://sayands.github.io/guideflow3d/)

点击查看摘要

Abstract:Transferring appearance to 3D assets using different representations of the appearance object - such as images or text - has garnered interest due to its wide range of applications in industries like gaming, augmented reality, and digital content creation. However, state-of-the-art methods still fail when the geometry between the input and appearance objects is significantly different. A straightforward approach is to directly apply a 3D generative model, but we show that this ultimately fails to produce appealing results. Instead, we propose a principled approach inspired by universal guidance. Given a pretrained rectified flow model conditioned on image or text, our training-free method interacts with the sampling process by periodically adding guidance. This guidance can be modeled as a differentiable loss function, and we experiment with two different types of guidance including part-aware losses for appearance and self-similarity. Our experiments show that our approach successfully transfers texture and geometric details to the input 3D asset, outperforming baselines both qualitatively and quantitatively. We also show that traditional metrics are not suitable for evaluating the task due to their inability of focusing on local details and comparing dissimilar inputs, in absence of ground truth data. We thus evaluate appearance transfer quality with a GPT-based system objectively ranking outputs, ensuring robust and human-like assessment, as further confirmed by our user study. Beyond showcased scenarios, our method is general and could be extended to different types of diffusion models and guidance functions.

197. 【2510.16134】Aria Gen 2 Pilot Dataset

链接https://arxiv.org/abs/2510.16134

作者:Chen Kong,James Fort,Aria Kang,Jonathan Wittmer,Simon Green,Tianwei Shen,Yipu Zhao,Cheng Peng,Gustavo Solaira,Andrew Berkovich,Nikhil Raina,Vijay Baiyya,Evgeniy Oleinik,Eric Huang,Fan Zhang,Julian Straub,Mark Schwesinger,Luis Pesqueira,Xiaqing Pan,Jakob Julian Engel,Carl Ren,Mingfei Yan,Richard Newcombe

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:egocentric multimodal open, multimodal open dataset, open dataset captured, Aria Gen, Pilot Dataset

备注

点击查看摘要

Abstract:The Aria Gen 2 Pilot Dataset (A2PD) is an egocentric multimodal open dataset captured using the state-of-the-art Aria Gen 2 glasses. To facilitate timely access, A2PD is released incrementally with ongoing dataset enhancements. The initial release features Dia'ane, our primary subject, who records her daily activities alongside friends, each equipped with Aria Gen 2 glasses. It encompasses five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. In each of the scenarios, we provide comprehensive raw sensor data and output data from various machine perception algorithms. These data illustrate the device's ability to perceive the wearer, the surrounding environment, and interactions between the wearer and the environment, while maintaining robust performance across diverse users and conditions. The A2PD is publicly available at this http URL, with open-source tools and usage examples provided in Project Aria Tools.

198. 【2510.16118】ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles

链接https://arxiv.org/abs/2510.16118

作者:Nishad Sahu,Shounak Sural,Aditya Satish Patil,Ragunathan(Raj)Rajkumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:safety critical decision, critical decision making, Reliable perception, autonomous driving, fundamental for safety

备注: Accepted at International Conference on Computer Vision (ICCV) 2025 Workshops

点击查看摘要

Abstract:Reliable perception is fundamental for safety critical decision making in autonomous driving. Yet, vision based object detector neural networks remain vulnerable to uncertainty arising from issues such as data bias and distributional shifts. In this paper, we introduce ObjectTransforms, a technique for quantifying and reducing uncertainty in vision based object detection through object specific transformations at both training and inference times. At training time, ObjectTransforms perform color space perturbations on individual objects, improving robustness to lighting and color variations. ObjectTransforms also uses diffusion models to generate realistic, diverse pedestrian instances. At inference time, object perturbations are applied to detected objects and the variance of detection scores are used to quantify predictive uncertainty in real time. This uncertainty signal is then used to filter out false positives and also recover false negatives, improving the overall precision recall curve. Experiments with YOLOv8 on the NuImages 10K dataset demonstrate that our method yields notable accuracy improvements and uncertainty reduction across all object classes during training, while predicting desirably higher uncertainty values for false positives as compared to true positives during inference. Our results highlight the potential of ObjectTransforms as a lightweight yet effective mechanism for reducing and quantifying uncertainty in vision-based perception during training and inference respectively.

199. 【2510.16115】StripRFNet: A Strip Receptive Field and Shape-Aware Network for Road Damage Detection

链接https://arxiv.org/abs/2510.16115

作者:Jianhan Lin,Yuchu Qin,Shuai Gao,Yikang Rui,Jie Liu,Yanjie Lv

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sustainable Development Goal, achieving Sustainable Development, Well-maintained road networks, Development Goal, Well-maintained road

备注

点击查看摘要

Abstract:Well-maintained road networks are crucial for achieving Sustainable Development Goal (SDG) 11. Road surface damage not only threatens traffic safety but also hinders sustainable urban development. Accurate detection, however, remains challenging due to the diverse shapes of damages, the difficulty of capturing slender cracks with high aspect ratios, and the high error rates in small-scale damage recognition. To address these issues, we propose StripRFNet, a novel deep neural network comprising three modules: (1) a Shape Perception Module (SPM) that enhances shape discrimination via large separable kernel attention (LSKA) in multi-scale feature aggregation; (2) a Strip Receptive Field Module (SRFM) that employs large strip convolutions and pooling to capture features of slender cracks; and (3) a Small-Scale Enhancement Module (SSEM) that leverages a high-resolution P2 feature map, a dedicated detection head, and dynamic upsampling to improve small-object detection. Experiments on the RDD2022 benchmark show that StripRFNet surpasses existing methods. On the Chinese subset, it improves F1-score, mAP50, and mAP50:95 by 4.4, 2.9, and 3.4 percentage points over the baseline, respectively. On the full dataset, it achieves the highest F1-score of 80.33% compared with CRDDC'2022 participants and ORDDC'2024 Phase 2 results, while maintaining competitive inference speed. These results demonstrate that StripRFNet achieves state-of-the-art accuracy and real-time efficiency, offering a promising tool for intelligent road maintenance and sustainable infrastructure management.

200. 【2510.16088】Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch

链接https://arxiv.org/abs/2510.16088

作者:Zia Badar

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:Quantization, bit quantization, Previous work, memory requirements, shift bit quantization

备注

点击查看摘要

Abstract:Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.

201. 【2510.16078】ISO/IEC-Compliant Match-on-Card Face Verification with Short Binary Templates

链接https://arxiv.org/abs/2510.16078

作者:Abdelilah Ganmati,Karim Afdel,Lahcen Koutti

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:constant-time Hamming distance, Hamming distance, off-card by PCA-ITQ, PCA-ITQ and compared, present a practical

备注: ~14 pages, 6 figures, 6 tables. Source uses elsarticle class; all figures included as PNG/PDF. Primary: cs.CV

点击查看摘要

Abstract:We present a practical match-on-card design for face verification in which compact 64/128-bit templates are produced off-card by PCA-ITQ and compared on-card via constant-time Hamming distance. We specify ISO/IEC 7816-4 and 14443-4 command APDUs with fixed-length payloads and decision-only status words (no score leakage), together with a minimal per-identity EEPROM map. Using real binary codes from a CelebA working set (55 identities, 412 images), we (i) derive operating thresholds from ROC/DET, (ii) replay enroll-verify transactions at those thresholds, and (iii) bound end-to-end time by pure link latency plus a small constant on-card budget. Even at the slowest contact rate (9.6 kbps), total verification time is 43.9 ms (64 b) and 52.3 ms (128 b); at 38.4 kbps both are 14 ms. At FAR = 1%, both code lengths reach TPR = 0.836, while 128 b lowers EER relative to 64 b. An optional +6 B helper (targeted symbol-level parity over empirically unstable bits) is latency-negligible. Overall, short binary templates, fixed-payload decision-only APDUs, and constant-time matching satisfy ISO/IEC transport constraints with wide timing margin and align with ISO/IEC 24745 privacy goals. Limitations: single-dataset evaluation and design-level (pre-hardware) timing; we outline AgeDB/CFP-FP and on-card microbenchmarks as next steps.

202. 【2510.16072】Data-Driven Analysis of Intersectional Bias in Image Classification: A Framework with Bias-Weighted Augmentation

链接https://arxiv.org/abs/2510.16072

作者:Farjana Yesmin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Machine learning models, biases-systematic errors arising, Machine learning, exhibit intersectional biases-systematic, intersectional biases-systematic errors

备注: 18 pages

点击查看摘要

Abstract:Machine learning models trained on imbalanced datasets often exhibit intersectional biases-systematic errors arising from the interaction of multiple attributes such as object class and environmental conditions. This paper presents a data-driven framework for analyzing and mitigating such biases in image classification. We introduce the Intersectional Fairness Evaluation Framework (IFEF), which combines quantitative fairness metrics with interpretability tools to systematically identify bias patterns in model predictions. Building on this analysis, we propose Bias-Weighted Augmentation (BWA), a novel data augmentation strategy that adapts transformation intensities based on subgroup distribution statistics. Experiments on the Open Images V7 dataset with five object classes demonstrate that BWA improves accuracy for underrepresented class-environment intersections by up to 24 percentage points while reducing fairness metric disparities by 35%. Statistical analysis across multiple independent runs confirms the significance of improvements (p 0.05). Our methodology provides a replicable approach for analyzing and addressing intersectional biases in image classification systems.

203. 【2510.16070】Effect of Reporting Mode and Clinical Experience on Radiologists' Gaze and Image Analysis Behavior in Chest Radiography

链接https://arxiv.org/abs/2510.16070

作者:Mahta Khoobi,Marc Sebastian von der Stueck,Felix Barajas Ordonez,Anca-Maria Iancu,Eric Corban,Julia Nowak,Aleksandar Kargaliev,Valeria Perelygina,Anna-Sophie Schott,Daniel Pinto dos Santos,Christiane Kuhl,Daniel Truhn,Sven Nebelung,Robert Siepmann

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)

关键词:Structured reporting, AI-assisted structured reporting, artificial intelligence, imaging studies, July to December

备注: Preprint version - Under second revision at Radiology (manuscript RAD-25-1348)

点击查看摘要

Abstract:Structured reporting (SR) and artificial intelligence (AI) may transform how radiologists interact with imaging studies. This prospective study (July to December 2024) evaluated the impact of three reporting modes: free-text (FT), structured reporting (SR), and AI-assisted structured reporting (AI-SR), on image analysis behavior, diagnostic accuracy, efficiency, and user experience. Four novice and four non-novice readers (radiologists and medical students) each analyzed 35 bedside chest radiographs per session using a customized viewer and an eye-tracking system. Outcomes included diagnostic accuracy (compared with expert consensus using Cohen's $\kappa$), reporting time per radiograph, eye-tracking metrics, and questionnaire-based user experience. Statistical analysis used generalized linear mixed models with Bonferroni post-hoc tests with a significance level of ($P \le .01$). Diagnostic accuracy was similar in FT ($\kappa = 0.58$) and SR ($\kappa = 0.60$) but higher in AI-SR ($\kappa = 0.71$, $P .001$). Reporting times decreased from $88 \pm 38$ s (FT) to $37 \pm 18$ s (SR) and $25 \pm 9$ s (AI-SR) ($P .001$). Saccade counts for the radiograph field ($205 \pm 135$ (FT), $123 \pm 88$ (SR), $97 \pm 58$ (AI-SR)) and total fixation duration for the report field ($11 \pm 5$ s (FT), $5 \pm 3$ s (SR), $4 \pm 1$ s (AI-SR)) were lower with SR and AI-SR ($P .001$ each). Novice readers shifted gaze towards the radiograph in SR, while non-novice readers maintained their focus on the radiograph. AI-SR was the preferred mode. In conclusion, SR improves efficiency by guiding visual attention toward the image, and AI-prefilled SR further enhances diagnostic accuracy and user satisfaction.

204. 【2510.16065】FedPURIN: Programmed Update and Reduced INformation for Sparse Personalized Federated Learning

链接https://arxiv.org/abs/2510.16065

作者:Lunchen Xie,Zehua He,Qingjiang Shi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Personalized Federated Learning, research frontier addressing, Personalized Federated, critical research frontier, frontier addressing data

备注

点击查看摘要

Abstract:Personalized Federated Learning (PFL) has emerged as a critical research frontier addressing data heterogeneity issue across distributed clients. Novel model architectures and collaboration mechanisms are engineered to accommodate statistical disparities while producing client-specific models. Parameter decoupling represents a promising paradigm for maintaining model performance in PFL frameworks. However, the communication efficiency of many existing methods remains suboptimal, sustaining substantial communication burdens that impede practical deployment. To bridge this gap, we propose Federated Learning with Programmed Update and Reduced INformation (FedPURIN), a novel framework that strategically identifies critical parameters for transmission through an integer programming formulation. This mathematically grounded strategy is seamlessly integrated into a sparse aggregation scheme, achieving a significant communication reduction while preserving the efficacy. Comprehensive evaluations on standard image classification benchmarks under varied non-IID conditions demonstrate competitive performance relative to state-of-the-art methods, coupled with quantifiable communication reduction through sparse aggregation. The framework establishes a new paradigm for communication-efficient PFL, particularly advantageous for edge intelligence systems operating with heterogeneous data sources.

205. 【2510.16036】IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

链接https://arxiv.org/abs/2510.16036

作者:Zewen Li,Zitong Yu,Qilang Ye,Weicheng Xie,Wei Zhuo,Linlin Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词

备注: Accepted by IEEE Transactions on Instrumentation and Measurement (TIM)

点击查看摘要

None

206. 【2510.16017】InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects

链接https://arxiv.org/abs/2510.16017

作者:Ibrahim Sheikh Mohamed,Abdullah Yahya Abdullah Omaisan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:closed circuit television, Infrastructure in smart, circuit television, smart cities, cities is increasingly

备注

点击查看摘要

Abstract:Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.

207. 【2510.15991】CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection

链接https://arxiv.org/abs/2510.15991

作者:Huiming Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词

备注: 13 pages

点击查看摘要

None

208. 【2510.15963】ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

链接https://arxiv.org/abs/2510.15963

作者:Jiani Huang,Amish Sethi,Matthew Kuo,Mayank Keoliya,Neelay Velingker,JungHo Jung,Ser-Nam Lim,Ziyang Li,Mayur Naik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词

备注: Accepted as a Spotlight Paper at NeurIPS 2025

点击查看摘要

None

209. 【2510.17540】Detecting streaks in smart telescopes images with Deep Learning

链接https://arxiv.org/abs/2510.17540

作者:Olivier Parisot,Mahmoud Jaziri

类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)

关键词

备注: 19 pages, preprint submitted to the Springer CCIS Special Issue on DATA 2024 (currently under editorial processing)

点击查看摘要

None

210. 【2510.16321】me-Embedded Algorithm Unrolling for Computational MRI

链接https://arxiv.org/abs/2510.16321

作者:Junno Yun,Yaşar Utku Alçalar,Mehmet Akçakaya

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

关键词

备注: Neural Information Processing Systems (NeurIPS), 2025

点击查看摘要

None

211. 【2510.16310】Lung Cancer Classification from CT Images Using ResNet

链接https://arxiv.org/abs/2510.16310

作者:Olajumoke O. Adekunle,Joseph D. Akinyemi,Khadijat T. Ladoja,Olufade F.W. Onifade

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词

备注: 9 pages,4 figures, 3 tables

点击查看摘要

None