本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新446篇论文,其中:

  • 自然语言处理66
  • 信息检索14
  • 计算机视觉92

自然语言处理

1. 【2510.27688】Continuous Autoregressive Language Models

链接https://arxiv.org/abs/2510.27688

作者:Chenze Shao,Darren Li,Fandong Meng,Jie Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, generation process, efficiency of large, fundamentally limited, Autoregressive Language Models

备注

点击查看摘要

Abstract:The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: this https URL. Project: this https URL.

2. 【2510.27672】Culture Cartography: Mapping the Landscape of Cultural Knowledge

链接https://arxiv.org/abs/2510.27672

作者:Caleb Ziems,William Held,Jane Yu,Amir Goldberg,David Grusky,Diyi Yang

类目:Computation and Language (cs.CL)

关键词:serve global users, global users safely, safely and productively, learned during pre-training, serve global

备注: EMNLP 2025

点击查看摘要

Abstract:To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.

3. 【2510.27641】SpecAttn: Speculating Sparse Attention

链接https://arxiv.org/abs/2510.27641

作者:Harsh Shah

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词:Large Language Models, Large Language, context lengths increase, Language Models, self-attention mechanisms

备注: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference Generative Modeling

点击查看摘要

Abstract:Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.

4. 【2510.27623】Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

链接https://arxiv.org/abs/2510.27623

作者:Qiusi Zhan,Hyeonjeong Ha,Rui Yang,Sirui Xu,Hanyang Chen,Liang-Yan Gui,Yu-Xiong Wang,Huan Zhang,Heng Ji,Daniel Kang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, large language models, enabling direct perception, planning task-oriented actions, Multimodal large

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.

5. 【2510.27571】owards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

链接https://arxiv.org/abs/2510.27571

作者:Zhuoning Guo,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Xiaowen Chu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:narrow benchmarks incentivize, benchmarks incentivize correspondingly, incentivize correspondingly limited, prevailing video retrieval, video retrieval paradigm

备注

点击查看摘要

Abstract:The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

6. 【2510.27569】MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

链接https://arxiv.org/abs/2510.27569

作者:Qi Luo,Xiaonan Li,Yuxin Wang,Tingshuo Fan,Yuan Li,Xinchi Chen,Xipeng Qiu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, static pretraining data, Language Models, pretraining data

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.

7. 【2510.27568】SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

链接https://arxiv.org/abs/2510.27568

作者:Ali Asgarov,Umid Suleymanov,Aadyant Khatri

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Solving mathematical reasoning, Solving mathematical, reasoning problems requires, mathematical reasoning problems, multi-step thinking

备注: Short Paper - Under Review

点击查看摘要

Abstract:Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

8. 【2510.27556】Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization

链接https://arxiv.org/abs/2510.27556

作者:Inacio Vieira,Antonio Castaldo,James O'Doherty,Sheila Castilho

类目:Computation and Language (cs.CL)

关键词:LLMs often require, expensive when relying, relying solely, require adaptation, domain-specific requirements

备注

点击查看摘要

Abstract:LLMs often require adaptation to domain-specific requirements, a process that can be expensive when relying solely on SFT. We present an empirical study on applying CPO to simulate a post-editing workflow for data-efficient domain adaptation. Our approach synthesizes preference pairs by treating the base model's own raw output as the 'rejected' translation and the human-approved TM entry as the 'chosen' one. This method provides direct feedback on the model's current knowledge, guiding it to align with domain-specific standards. Experiments in English-Brazilian Portuguese and English-Korean show that, by using just 14.7k preference pairs, the model achieves performance close to that of a model trained on 160k+ samples with SFT, demonstrating significant data efficiency. Although we showcase its effectiveness in MT, this application of CPO naturally generalizes to other generative tasks where a model's initial drafts can serve as a contrastive signal against a golden reference.

9. 【2510.27552】Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality

链接https://arxiv.org/abs/2510.27552

作者:Yinghao Luo(1 and 2),Lang Zhou(1 and 2),Amrish Jhingoer(1 and 2),Klaske Vliegenthart Jongbloed(3 and 4),Carlijn Jordans(4),Ben Werkhoven(5),Tom Seinen(6),Erik van Mulligen(6),Casper Rokx(3 and 4),Yunlei Li(1) ((1) Department of Pathology amp; Clinical Bioinformatics, Erasmus University Medical Center Rotterdam, (2) Department of Computer Science, Vrije Universiteit Amsterdam, (3) Department of Internal Medicine, Erasmus University Medical Center Rotterdam, (4) Department of Medical Microbiology and Infectious Diseases, Erasmus University Medical Center Rotterdam, (5) Department of Data and Analytics, Erasmus University Medical Center Rotterdam, (6) Department of Medical Informatics, Erasmus University Medical Center Rotterdam)

类目:Computation and Language (cs.CL)

关键词:multilingual healthcare applications, natural language processing, tools is limited, Romanian and Spanish, healthcare applications

备注

点击查看摘要

Abstract:In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.

10. 【2510.27543】DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

链接https://arxiv.org/abs/2510.27543

作者:Malik H. Altakrori,Nizar Habash,Abdelhakim Freihat,Younes Samih,Kirill Chirkunov,Muhammed AbuOdeh,Radu Florian,Teresa Lynn,Preslav Nakov,Alham Fikri Aji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Modern Standard Arabic, large language models, large language, Arabic, Modern Standard

备注: 9 pages, 9 tables

点击查看摘要

Abstract:We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.

11. 【2510.27535】Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design

链接https://arxiv.org/abs/2510.27535

作者:Maria Lizarazo Jimenez,Ana Gabriela Claros,Kieran Green,David Toro-Tobon,Felipe Larios,Sheena Asthana,Camila Wenczenovicz,Kerly Guevara Maldonado,Luis Vilatuna-Andrango,Cristina Proano-Velez,Satya Sai Sri Bandi,Shubhangi Bagewadi,Megan E. Branda,Misk Al Zahidy,Saturnino Luz,Mirella Lapata,Juan P. Brito,Oscar J. Ponce-Ponte

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, generating clinical summaries, Language Models, reach human-level performance

备注: The first two listed authors contributed equally Pages: 21; Figures:2; Tables:3

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.

12. 【2510.27532】SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps

链接https://arxiv.org/abs/2510.27532

作者:Neha Srikanth,Victor Bursztyn,Puneet Mathur,Ani Nenkova

类目:Computation and Language (cs.CL)

关键词:minimal human intervention, human intervention, derived with minimal, minimal human, compact representation

备注: Accepted to EMNLP Findings

点击查看摘要

Abstract:We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.

13. 【2510.27516】BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization

链接https://arxiv.org/abs/2510.27516

作者:Desta Haileselassie Hagos,Legand L. Burge,Anietie Andy,Anis Yazidi,Vladimir Vlassov

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:quadratic complexity limits, Transformer-based architectures, complexity limits scalability, Adaptive Spans, long documents

备注: Accepted at the IEEE International Conference on Data Mining (ICDM) 2025, Washington, DC, USA

点击查看摘要

Abstract:Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.

14. 【2510.27512】Effect of Domain Generalization Techniques in Low Resource Systems

链接https://arxiv.org/abs/2510.27512

作者:Mahi Aminu,Chisom Chibuike,Fatimo Adebanjo,Omokolade Awosanya,Samuel Oyeneye

类目:Computation and Language (cs.CL)

关键词:real-world scenarios due, Machine learning models, models typically assume, test data follow, Machine learning

备注

点击查看摘要

Abstract:Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.

15. 【2510.27484】hought Branches: Interpreting LLM Reasoning Requires Resampling

链接https://arxiv.org/abs/2510.27484

作者:Uzay Macar,Paul C. Bogdan,Senthooran Rajamanoharan,Neel Nanda

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:work interpreting reasoning, work interpreting, models define distributions, model, define distributions

备注: Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip

点击查看摘要

Abstract:Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.

16. 【2510.27477】he aftermath of compounds: Investigating Compounds and their Semantic Representations

链接https://arxiv.org/abs/2510.27477

作者:Swarang Joshi

类目:Computation and Language (cs.CL)

关键词:English compound words, Edinburgh Associative Thesaurus, English compound, computational embeddings align, study investigates

备注: IJCNLP-AACL SRW 2025

点击查看摘要

Abstract:This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.

17. 【2510.27469】Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

链接https://arxiv.org/abs/2510.27469

作者:Chenyang Shao,Sijian Ren,Fengli Xu,Yong Li

类目:Computation and Language (cs.CL)

关键词:witnessed remarkable advancements, law consistently enhancing, scaling law consistently, large language models, recent years

备注

点击查看摘要

Abstract:In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs' autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at this https URL.

18. 【2510.27462】VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

链接https://arxiv.org/abs/2510.27462

作者:Xuan Gong,Senmiao Wang,Hanbo Huang,Ruoyu Sun,Shiyu Liang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Supervised fine-tuning, large language models, trajectories has emerged, crucial technique, technique for enhancing

备注: Under Review

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at this https URL.

19. 【2510.27419】DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

链接https://arxiv.org/abs/2510.27419

作者:Tian Liang,Wenxiang Jiao,Zhiwei He,Jiahao Xu,Haitao Mi,Dong Yu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demonstrated impressive capabilities, Large Reasoning Models, Large Reasoning, underthinking complex, demonstrated impressive

备注: Work in progress

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ``overthinking'' simple problems and ``underthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for ``Simple'' problems while promoting longer, more exploratory thought chains for ``Hard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.

20. 【2510.27418】Dynamic Affective Memory Management for Personalized LLM Agents

链接https://arxiv.org/abs/2510.27418

作者:Junfeng Lu,Yueyan Li

类目:Computation and Language (cs.CL)

关键词:large language models, Advances in large, research focus, large language, language models

备注: 12 pasges, 8 figures

点击查看摘要

Abstract:Advances in large language models are making personalized AI agents a new research focus. While current agent systems primarily rely on personalized external memory databases to deliver customized experiences, they face challenges such as memory redundancy, memory staleness, and poor memory-context integration, largely due to the lack of effective memory updates during interaction. To tackle these issues, we propose a new memory management system designed for affective scenarios. Our approach employs a Bayesian-inspired memory update algorithm with the concept of memory entropy, enabling the agent to autonomously maintain a dynamically updated memory vector database by minimizing global entropy to provide more personalized services. To better evaluate the system's effectiveness in this context, we propose DABench, a benchmark focusing on emotional expression and emotional change toward objects. Experimental results demonstrate that, our system achieves superior performance in personalization, logical coherence, and accuracy. Ablation studies further validate the effectiveness of the Bayesian-inspired update mechanism in alleviating memory bloat. Our work offers new insights into the design of long-term memory systems.

21. 【2510.27413】Atlas-Alignment: Making Interpretability Transferable Across Language Models

链接https://arxiv.org/abs/2510.27413

作者:Bruno Puri,Jim Berend,Sebastian Lapuschkin,Wojciech Samek

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:pipelines remain costly, existing interpretability pipelines, interpretability pipelines remain, building safe, difficult to scale

备注

点击查看摘要

Abstract:Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

22. 【2510.27407】Awal -- Community-Powered Language Technology for Tamazight

链接https://arxiv.org/abs/2510.27407

作者:Alp Öktem,Farida Boudichat

类目:Computation and Language (cs.CL)

关键词:paper presents Awal, presents Awal, developing language technology, language technology resources, paper presents

备注: Accepted to the International Conference on Information and Communication Technologies for Amazigh (TICAM 25)

点击查看摘要

Abstract:This paper presents Awal, a community-powered initiative for developing language technology resources for Tamazight. We provide a comprehensive review of the NLP landscape for Tamazight, examining recent progress in computational resources, and the emergence of community-driven approaches to address persistent data scarcity. Launched in 2024, this http URL platform addresses the underrepresentation of Tamazight in digital spaces through a collaborative platform enabling speakers to contribute translation and voice data. We analyze 18 months of community engagement, revealing significant barriers to participation including limited confidence in written Tamazight and ongoing standardization challenges. Despite widespread positive reception, actual data contribution remained concentrated among linguists and activists. The modest scale of community contributions -- 6,421 translation pairs and 3 hours of speech data -- highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts. We are working on improved open-source MT models using the collected data.

23. 【2510.27400】Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs

链接https://arxiv.org/abs/2510.27400

作者:Jiahao Liu,Zijian Wang,Kuo Zhao,Dong Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, language models, large language, Knowledge, updating factual knowledge

备注

点击查看摘要

Abstract:Knowledge editing has emerged as an efficient approach for updating factual knowledge in large language models (LLMs). It typically locates knowledge storage modules and then modifies their parameters. However, most existing methods focus on the weights of multilayer perceptron (MLP) modules, which are often identified as the main repositories of factual information. Other components, such as attention (Attn) modules, are often ignored during editing. This imbalance can leave residual outdated knowledge and limit editing effectiveness. We perform comprehensive knowledge localization experiments on advanced LLMs and find that Attn modules play a substantial role in factual knowledge storage and retrieval, especially in earlier layers. Based on these insights, we propose IntAttn-Edit, a method that extends the associative memory paradigm to jointly update both MLP and Attn modules. Our approach uses a knowledge balancing strategy that allocates update magnitudes in proportion to each module's measured contribution to knowledge storage. Experiments on standard benchmarks show that IntAttn-Edit achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods. Further analysis shows that the balancing strategy keeps editing performance within an optimal range across diverse settings.

24. 【2510.27378】Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

链接https://arxiv.org/abs/2510.27378

作者:Austin Meek,Eitan Sprejer,Iván Arcuschin,Austin J. Brockmeier,Steven Basart

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:CoT, model, reasoning, serial reasoning process, monitorability

备注: Project page at [this https URL](https://ajmeek.github.io/cot_monitorability_website/)

点击查看摘要

Abstract:Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.

25. 【2510.27369】From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle

链接https://arxiv.org/abs/2510.27369

作者:Tosin Adewumi,Martin Karlsson,Marcus Liwicki,Mikael Sjödahl,Lama Alkhaled,Rihab Gargouri,Nudrat Habib,Franz Hennie

类目:Computation and Language (cs.CL)

关键词:natural language processing, technical language processing, comprehensive systematic survey, language processing, Electronics Engineers Xplore

备注: 15 pages, 3 images

点击查看摘要

Abstract:We present a comprehensive systematic survey of the application of natural language processing (NLP) along the entire battery life cycle, instead of one stage or method, and introduce a novel technical language processing (TLP) framework for the EU's proposed digital battery passport (DBP) and other general battery predictions. We follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method and employ three reputable databases or search engines, including Google Scholar, Institute of Electrical and Electronics Engineers Xplore (IEEE Xplore), and Scopus. Consequently, we assessed 274 scientific papers before the critical review of the final 66 relevant papers. We publicly provide artifacts of the review for validation and reproducibility. The findings show that new NLP tasks are emerging in the battery domain, which facilitate materials discovery and other stages of the life cycle. Notwithstanding, challenges remain, such as the lack of standard benchmarks. Our proposed TLP framework, which incorporates agentic AI and optimized prompts, will be apt for tackling some of the challenges.

26. 【2510.27355】houghtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations

链接https://arxiv.org/abs/2510.27355

作者:Zijian Wang,Chang Xu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, paper introduces ThoughtProbe, Large Language, features of Large

备注: EMNLP2025 main conference

点击查看摘要

Abstract:This paper introduces ThoughtProbe, a novel inference time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework's comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.

27. 【2510.27337】ransAlign: Machine Translation Encoders are Strong Word Aligners, Too

链接https://arxiv.org/abs/2510.27337

作者:Benedikt Ebing,Christian Goldschmied,Goran Glavaš

类目:Computation and Language (cs.CL)

关键词:language data translated, sizable training data, source language data, target language data, data translated

备注

点击查看摘要

Abstract:In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

28. 【2510.27328】A Unified Representation Underlying the Judgment of Large Language Models

链接https://arxiv.org/abs/2510.27328

作者:Yi-Long Lu,Jiajun Song,Wei Wang

类目:Computation and Language (cs.CL)

关键词:central architectural question, Large Language Models, domain-general resource, central architectural, biological and artificial

备注

点击查看摘要

Abstract:A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we show this unified representation creates a critical dependency: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. This mechanism, which we term the subordination of reasoning, shifts the process of reasoning from impartial inference toward goal-directed justification. Our discovery offers a mechanistic account for systemic bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.

29. 【2510.27313】Un-Attributability: Computing Novelty From Retrieval Semantic Similarity

链接https://arxiv.org/abs/2510.27313

作者:Philipp Davydov,Ameya Prabhu,Matthias Bethge,Elisa Nguyen,Seong Joon Oh

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Understanding how language-model, language-model outputs relate, studying model behavior, central to studying, Understanding

备注

点击查看摘要

Abstract:Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at this https URL

30. 【2510.27269】Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

链接https://arxiv.org/abs/2510.27269

作者:Deokhyung Kang,Seonjeong Hwang,Daehui Kim,Hyounghun Kim,Gary Geunbae Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:multilingual reasoning gap, complex reasoning tasks, multilingual reasoning, reasoning gap, Reasoning

备注

点击查看摘要

Abstract:Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at this https URL.

31. 【2510.27267】MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

链接https://arxiv.org/abs/2510.27267

作者:Kangkun Mao,Jinru Ding,Jiayuan Chen,Mouxiao Bian,Ruiyao Chen,Xinwei Peng,Sijie Ren,Linyang Li,Jie Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:overlooking quantitative reasoning, quantitative reasoning critical, Glasgow Coma Scale, overlooking quantitative, question answering

备注

点击查看摘要

Abstract:As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at this https URL.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2510.27267 [cs.CL]

(or
arXiv:2510.27267v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.27267

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Kangkun Mao [view email] [v1]
Fri, 31 Oct 2025 08:07:16 UTC (7,791 KB)

32. 【2510.27258】Higher-order Linear Attention

链接https://arxiv.org/abs/2510.27258

作者:Yifan Zhang,Zhen Qin,Quanquan Gu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:scaling autoregressive language, autoregressive language models, scaled dot-product attention, State Space Models, long contexts

备注: Project Page: [this https URL](https://github.com/yifanzhang-pro/HLA)

点击查看摘要

Abstract:The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: this https URL.

33. 【2510.27254】Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

链接https://arxiv.org/abs/2510.27254

作者:Rajan Agarwal,Aarush Gupta

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Instruction-tuned Large Language, non-Latin scripts due, Large Language Models, Instruction-tuned Large, weak cross-lingual coupling

备注: 14 pages, 3 Figures

点击查看摘要

Abstract:Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged QA evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.

34. 【2510.27246】Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

链接https://arxiv.org/abs/2510.27246

作者:Mohammad Tavakoli,Alireza Salemi,Carrie Ye,Mohamed Abdalla,Hamed Zamani,J Ross Mitchell

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:cover narrow domains, lack narrative coherence, simple recall-oriented tasks, test simple recall-oriented, large language models

备注

点击查看摘要

Abstract:Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

35. 【2510.27241】Identifying the Periodicity of Information in Natural Language

链接https://arxiv.org/abs/2510.27241

作者:Yulin Ou,Yu Wang,Yang Xu,Hendrik Buschmeier

类目:Computation and Language (cs.CL)

关键词:Recent theoretical advancement, Recent theoretical, natural language exhibit, natural language, theoretical advancement

备注

点击查看摘要

Abstract:Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.

36. 【2510.27238】DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries

链接https://arxiv.org/abs/2510.27238

作者:Chuxuan Hu,Maxwell Yang,James Weiland,Yeji Lim,Suhas Palawala,Daniel Kang

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Manually conducting real-world, Manually conducting, data, labor-intensive and inefficient, analyses is labor-intensive

备注: Accepted to SIGMOD 2026

点击查看摘要

Abstract:Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users' analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at this https URL.

Comments:
Accepted to SIGMOD 2026

Subjects:

Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2510.27238 [cs.DB]

(or
arXiv:2510.27238v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2510.27238

Focus to learn more

              arXiv-issued DOI via DataCite

Related DOI:

https://doi.org/10.1145/3769781

Focus to learn more

            DOI(s) linking to related resources</p>
37. 【2510.27196】MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

链接https://arxiv.org/abs/2510.27196

作者:Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Yayue Deng,Jing Ma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, multimodal Large Language, Language Models, Large Language, social media necessitates

备注: EMNLP 2025

点击查看摘要

Abstract:The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at this https URL.

38. 【2510.27195】Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

链接https://arxiv.org/abs/2510.27195

作者:Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Yoichi Sato

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:critical frontier, robust social intelligence, increasingly integrated, human lives, Multimodal Large Language

备注

点击查看摘要

Abstract:As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

39. 【2510.27183】Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

链接https://arxiv.org/abs/2510.27183

作者:Mason Shipton,York Hay Ng,Aditya Khan,Phuong Hanh Hoang,Xiang Lu,A. Seza Doğruöz,En-Shiun Annie Lee

类目:Computation and Language (cs.CL)

关键词:linguistic knowledge base, knowledge base supports, base supports multilingual, linguistic knowledge, URIEL

备注

点击查看摘要

Abstract:The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.

40. 【2510.27176】Glia: A Human-Inspired AI for Automated Systems Design and Optimization

链接https://arxiv.org/abs/2510.27176

作者:Pouya Hamadanian,Pantea Karimi,Arash Nasr-Esfahany,Kimia Noorbakhsh,Joseph Chandler,Ali ParandehGheibi,Mohammad Alizadeh,Hari Balakrishnan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:autonomously design mechanisms, human experts, mechanisms for computer, computer systems, reasoning

备注

点击查看摘要

Abstract:Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.

41. 【2510.27118】Probability Distributions Computed by Hard-Attention Transformers

链接https://arxiv.org/abs/2510.27118

作者:Andy Yang,Anej Svete,Jiaoda Li,Anthony Widjaja Lin,Jonathan Rawski,Ryan Cotterell,David Chiang

类目:Computation and Language (cs.CL)

关键词:generate strings autoregressively, autoregressively and probabilistically, language models, accept or reject, reject strings

备注: 18 pages

点击查看摘要

Abstract:Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

42. 【2510.27106】Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

链接https://arxiv.org/abs/2510.27106

作者:Rajarshi Haldar,Julia Hockenmaier

类目:Computation and Language (cs.CL)

关键词:Natural Language Generation, Natural Language, widely adopted, properly assessing, Language Generation

备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

43. 【2510.27087】Characterizing Selective Refusal Bias in Large Language Models

链接https://arxiv.org/abs/2510.27087

作者:Adel Khorramrouz,Sharon Levy

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:large language models, prevent malicious users, generating toxic content, large scale, language models

备注: 21 pages, 12 figures, 14 tables

点击查看摘要

Abstract:Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.

44. 【2510.27077】Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models

链接https://arxiv.org/abs/2510.27077

作者:Jiasen Zheng,Huajun Zhang,Xu Yan,Ran Hao,Chong Peng

类目:Computation and Language (cs.CL)

关键词:large-scale language models, combines contrastive distillation, paper addresses, addresses the limitations, limitations of large-scale

备注

点击查看摘要

Abstract:This paper addresses the limitations of large-scale language models in safety alignment and robustness by proposing a fine-tuning method that combines contrastive distillation with noise-robust training. The method freezes the backbone model and transfers the knowledge boundaries of the teacher model to the student model through distillation, thereby improving semantic consistency and alignment accuracy. At the same time, noise perturbations and robust optimization constraints are introduced during training to ensure that the model maintains stable predictive outputs under noisy and uncertain inputs. The overall framework consists of distillation loss, robustness loss, and a regularization term, forming a unified optimization objective that balances alignment ability with resistance to interference. To systematically validate its effectiveness, the study designs experiments from multiple perspectives, including distillation weight sensitivity, stability analysis under computation budgets and mixed-precision environments, and the impact of data noise and distribution shifts on model performance. Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving the best performance across several key metrics. This work not only enriches the theoretical system of parameter-efficient fine-tuning but also provides a new solution for building safer and more trustworthy alignment mechanisms.

45. 【2510.27063】owards a Measure of Algorithm Similarity

链接https://arxiv.org/abs/2510.27063

作者:Shairoz Sohail,Taher Ali

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Software Engineering (cs.SE)

关键词:EMOC, similarity, Abstract, EMOC features support, algorithm

备注: 11 pages, many figures and images

点击查看摘要

Abstract:Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.

46. 【2510.27055】Detecting Data Contamination in LLMs via In-Context Learning

链接https://arxiv.org/abs/2510.27055

作者:Michał Zawalski,Meriem Boubdir,Klaudia Bałazy,Besmira Nushi,Pablo Ribalta

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Detection via Context, present Contamination Detection, large language models, Contamination Detection, quantify training data

备注

点击查看摘要

Abstract:We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.

47. 【2510.27054】LLM-Centric RAG with Multi-Granular Indexing and Confidence Constraints

链接https://arxiv.org/abs/2510.27054

作者:Xiaofan Guo,Yaxuan Luan,Yue Kang,Xiangchen Song,Jinxu Guo

类目:Computation and Language (cs.CL)

关键词:complex knowledge environments, integrates multi-granularity memory, paper addresses, addresses the issues, issues of insufficient

备注

点击查看摘要

Abstract:This paper addresses the issues of insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments, and proposes a confidence control method that integrates multi-granularity memory indexing with uncertainty estimation. The method builds a hierarchical memory structure that divides knowledge representations into different levels of granularity, enabling dynamic indexing and retrieval from local details to global context, and thus establishing closer semantic connections between retrieval and generation. On this basis, an uncertainty estimation mechanism is introduced to explicitly constrain and filter low-confidence paths during the generation process, allowing the model to maintain information coverage while effectively suppressing noise and false content. The overall optimization objective consists of generation loss, entropy constraints, and variance regularization, forming a unified confidence control framework. In the experiments, comprehensive sensitivity tests and comparative analyses were designed, covering hyperparameters, environmental conditions, and data structures, to verify the stability and robustness of the proposed method across different scenarios. The results show that the method achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency, demonstrating the effectiveness of combining multi-granularity indexing with confidence control. This study not only provides a new technical pathway for retrieval-augmented generation but also offers practical evidence for improving the reliability and controllability of large models in complex contexts.

48. 【2510.27052】VISTA Score: Verification In Sequential Turn-based Assessment

链接https://arxiv.org/abs/2510.27052

作者:Ashley Lewis,Andrew Perrault,Eric Fosler-Lussier,Michael White

类目:Computation and Language (cs.CL)

关键词:demand factual reliability, generating statements unsupported, remains a major, Sequential Turn-based Assessment, major obstacle

备注

点击查看摘要

Abstract:Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

49. 【2510.27049】Recursive numeral systems are highly regular and easy to process

链接https://arxiv.org/abs/2510.27049

作者:Ponrawee Prasertsom,Andrea Silvi,Jennifer Culbertson,Moa Johansson,Devdatt Dubhashi,Kenny Smith

类目:Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

关键词:Denić and Szymanik, average morphosyntatic complexity, recursive numeral systems, numeral systems optimise, lexicon size

备注

点击查看摘要

Abstract:Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including "optimal" recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.

50. 【2510.27045】Quantitative Intertextuality from the Digital Humanities Perspective: A Survey

链接https://arxiv.org/abs/2510.27045

作者:Siyu Duan

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:important theoretical basis, literary theory, connection between texts, texts is referred, important theoretical

备注

点击查看摘要

Abstract:The connection between texts is referred to as intertextuality in literary theory, which served as an important theoretical basis in many digital humanities studies. Over the past decade, advancements in natural language processing have ushered intertextuality studies into the quantitative age. Large-scale intertextuality research based on cutting-edge methods has continuously emerged. This paper provides a roadmap for quantitative intertextuality studies, summarizing their data, methods, and applications. Drawing on data from multiple languages and topics, this survey reviews methods from statistics to deep learning. It also summarizes their applications in humanities and social sciences research and the associated platform tools. Driven by advances in computer technology, more precise, diverse, and large-scale intertext studies can be anticipated. Intertextuality holds promise for broader application in interdisciplinary research bridging AI and the humanities.

51. 【2510.27038】Dataset Creation and Baseline Models for Sexism Detection in Hausa

链接https://arxiv.org/abs/2510.27038

作者:Fatima Adam Muhammad,Shamsuddeen Muhammad Hassan,Isa Inuwa-Dutse

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reinforces gender inequality, Sexism reinforces gender, perpetuating stereotypes, discriminatory norms, reinforces gender

备注: 9 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.

52. 【2510.27037】Elastic Architecture Search for Efficient Language Models

链接https://arxiv.org/abs/2510.27037

作者:Shang Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:raised significant economic, natural language understanding, large pre-trained language, pre-trained language models, Elastic Language Model

备注: ICME 2025

点击查看摘要

Abstract:As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.

53. 【2510.27017】Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral

链接https://arxiv.org/abs/2510.27017

作者:Ayoub Hammal,Pierre Zweigenbaum,Caio Corro

类目:Computation and Language (cs.CL)

关键词:large language models, previous works concluded, largest part, part of generation, generation capabilities

备注

点击查看摘要

Abstract:Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively. In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.

54. 【2510.27016】Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services

链接https://arxiv.org/abs/2510.27016

作者:Jayden Serenari,Stephen Lee

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Personally Identifiable Information, interactions with Large, share sensitive personal

备注: Accepted to IEEE Big Data 2025

点击查看摘要

Abstract:With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically-aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy-preserving output. We evaluate our approach using real-world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model's response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.

55. 【2510.26978】Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

链接https://arxiv.org/abs/2510.26978

作者:Anam Fatima,Yi Yu,Janak Kapuriya,Julien Lalanne,Jainendra Shukla

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:enhancing viewer engagement, video, surged in popularity, popularity on platforms, engagement through dynamic

备注

点击查看摘要

Abstract:Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.

56. 【2510.26974】Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations

链接https://arxiv.org/abs/2510.26974

作者:Jean-Philippe Corbeil,Asma Ben Abacha,Jerome Tremblay,Phillip Swazinna,Akila Jeeson Daniel,Miguel Del-Agua,Francois Beaulieu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Electronic Health Records, Health Records remains, Records remains unexplored, Clinical documentation increasingly, automatic speech recognition

备注

点击查看摘要

Abstract:Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants' solutions.

57. 【2510.26969】Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence

链接https://arxiv.org/abs/2510.26969

作者:Lívia Dutra,Arthur Lorenzi,Laís Berno,Franciany Campos,Karoline Biscardi,Kenneth Brown,Marcelo Viridiano,Frederico Belcavello,Ely Matos,Olívia Guaranha,Erik Santos,Sofia Reinach,Tiago Timponi Torrent

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:domain of healthcare, identification of notifiable, notifiable events, e-medical records, methodology harnesses semantic

备注

点击查看摘要

Abstract:We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients' visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.

58. 【2510.26935】RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification

链接https://arxiv.org/abs/2510.26935

作者:Yunhao Yang,Neel P. Bhatt,Pranay Samineni,Rohan Siva,Zhanyang Wang,Ufuk Topcu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

关键词:well-defined rules remains, remains a challenge, systems migrate, migrate to safety-critical, actions comply

备注: Code and data are available at: [this https URL](https://repv-project.github.io/)

点击查看摘要

Abstract:As AI systems migrate to safety-critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker, RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification, RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the planner, improving rule compliance without human annotations. Empirical evaluations show that RepV improves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: this https URL.

Comments:
Code and data are available at: this https URL

Subjects:

Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

Cite as:
arXiv:2510.26935 [cs.RO]

(or
arXiv:2510.26935v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2510.26935

Focus to learn more

              arXiv-issued DOI via DataCite</p>
59. 【2510.26912】Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

链接https://arxiv.org/abs/2510.26912

作者:Hyunji Lee,Wenhao Yu,Hongming Zhang,Kaixin Ma,Jiyeon Kim,Dong Yu,Minjoon Seo

类目:Computation and Language (cs.CL)

关键词:combine state space, shown strong performance, state space models, high recall ability, combine state

备注

点击查看摘要

Abstract:Hybrid models that combine state space models (SSMs) with attention mechanisms have shown strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the architectural design choices behind these hybrid models remain insufficiently understood. In this work, we analyze hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We first examine the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals several interesting findings, including that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. We also introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities. It generalizes well across different base models and outperforms architectural modifications aimed at enhancing recall. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases.

60. 【2510.26887】he Denario project: Deep knowledge AI agents for scientific discovery

链接https://arxiv.org/abs/2510.26887

作者:Francisco Villaescusa-Navarro,Boris Bolliet,Pablo Villanueva-Domingo,Adrian E. Bayer,Aidan Acquah,Chetana Amancharla,Almog Barzilay-Siegal,Pablo Bermejo,Camille Bilodeau,Pablo Cárdenas Ramírez,Miles Cranmer,Urbano L. França,ChangHoon Hahn,Yan-Fei Jiang,Raul Jimenez,Jun-Young Lee,Antonio Lerario,Osman Mamun,Thomas Meier,Anupam A. Ojha,Pavlos Protopapas,Shimanto Roy,David N. Spergel,Pedro Tarancón-Álvarez,Ujjwal Tiwari,Matteo Viel,Digvijay Wadekar,Chi Wang,Bonny Y. Wang,Licong Xu,Yossi Yovel,Shuwen Yue,Wen-Han Zhou,Qiyao Zhu,Jiajun Zou,Íñigo Zubeldia

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:multi-agent system designed, scientific research assistant, designed to serve, research assistant, present Denario

备注: 272 pages. Examples of 11 AI-generated paper drafts from different scientific disciplines. Code publicly available at [this https URL](https://github.com/AstroPilot-AI/Denario)

点击查看摘要

Abstract:We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at this https URL. A Denario demo can also be run directly on the web at this https URL, and the full app will be deployed on the cloud.

61. 【2510.26861】Evaluating Perspectival Biases in Cross-Modal Retrieval

链接https://arxiv.org/abs/2510.26861

作者:Teerapol Saengsukhiran,Peerawat Chomphooyod,Narabodee Rodjananant,Chompakorn Chaksangchaichot,Patawee Prakrankamanant,Witthawin Sripheanpol,Pak Lovichit,SarChaksaana Nutanong,Ekapol Chuangsuwanich

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:semantic space, expected to operate, cultural origin, bias, prevalence

备注

点击查看摘要

Abstract:Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.

62. 【2510.26852】CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions

链接https://arxiv.org/abs/2510.26852

作者:Lingyue Fu,Xin Ding,Yaoming Zhu,Shao Zhang,Lin Qiu,Weiwen Liu,Weinan Zhang,Xuezhi Cao,Xunliang Cai,Jiaxin Ding,Yong Yu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, basic text generation, autonomously completing complex

备注

点击查看摘要

Abstract:Large Language Model (LLM) agents have evolved from basic text generation to autonomously completing complex tasks through interaction with external tools. However, current benchmarks mainly assess end-to-end performance in fixed scenarios, restricting evaluation to specific skills and suffering from score saturation and growing dependence on expert annotation as agent capabilities improve. In this work, we emphasize the importance of learning ability, including both self-improvement and peer-learning, as a core driver for agent evolution toward human-level intelligence. We propose an iterative, competitive peer-learning framework, which allows agents to refine and optimize their strategies through repeated interactions and feedback, thereby systematically evaluating their learning capabilities. To address the score saturation issue in current benchmarks, we introduce CATArena, a tournament-style evaluation platform featuring four diverse board and card games with open-ended scoring. By providing tasks without explicit upper score limits, CATArena enables continuous and dynamic evaluation of rapidly advancing agent capabilities. Experimental results and analyses involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding.

63. 【2510.26847】Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

链接https://arxiv.org/abs/2510.26847

作者:Shaked Zychlinski,Yuval Kainan

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词:bypass safety guardrails, susceptible to jailbreak, bypass safety, Large Language Models, safety guardrails

备注: 16 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.

64. 【2505.13487】Detecting Prefix Bias in LLM-based Reward Models

链接https://arxiv.org/abs/2505.13487

作者:Ashwin Kumar,Yuzi He,Aram H. Markosyan,Bobbie Chern,Imanol Arrieta-Ibarra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Reinforcement Learning, Human Feedback, Learning with Human, human preference data, human preference

备注

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias -- a systematic shift in model preferences triggered by minor variations in query prefixes -- in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.

65. 【2505.01314】A Transformer-based Neural Architecture Search Method

链接https://arxiv.org/abs/2505.01314

作者:Shang Wang,Huanrong Tang,Jianquan Ouyang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:searching cross multihead, cross multihead attention, multihead attention computation, architecture search method, search method based

备注: GECCO 2023

点击查看摘要

Abstract:This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.

66. 【2504.15983】W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

链接https://arxiv.org/abs/2504.15983

作者:Shang Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:natural language processing, efficient natural language, zero-shot NAS methods, zero-shot NAS, NAS methods

备注: ICLR 2025

点击查看摘要

Abstract:The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding $\eta$ in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.

信息检索

1. 【2510.27584】Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

链接https://arxiv.org/abs/2510.27584

作者:Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Alexis Joly

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large-scale retrieval requires, retrieval requires representations, Efficient large-scale retrieval, compact and discriminative, large-scale retrieval

备注

点击查看摘要

Abstract:Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

2. 【2510.27571】owards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

链接https://arxiv.org/abs/2510.27571

作者:Zhuoning Guo,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Xiaowen Chu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:narrow benchmarks incentivize, benchmarks incentivize correspondingly, incentivize correspondingly limited, prevailing video retrieval, video retrieval paradigm

备注

点击查看摘要

Abstract:The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

3. 【2510.27566】Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval

链接https://arxiv.org/abs/2510.27566

作者:Yulong Hui,Chao Chen,Zhihang Fu,Yihao Liu,Jieping Ye,Huanchen Zhang

类目:Information Retrieval (cs.IR)

关键词:Retrieval-Augmented Generation, incorporating external information, significantly enhanced LLMs, incorporating external, prevailing agentic RAG

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation. This confines agents' actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.

4. 【2510.27342】Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation

链接https://arxiv.org/abs/2510.27342

作者:Alireza Gharahighehi,Felipe Kenji Nakano,Xuehua Yang,Wenhan Cu,Celine Vens

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Recommender systems, intelligent filtering methods, intelligent filtering, rating elicitation, interaction history

备注

点击查看摘要

Abstract:Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user's tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.

5. 【2510.27274】raceable Drug Recommendation over Medical Knowledge Graphs

链接https://arxiv.org/abs/2510.27274

作者:Yu Lin,Zhen Jia,Philipp Christmann,Xu Zhang,Shengdong Du,Tianrui Li

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:support healthcare professionals, patients' medical conditions, aim to support, support healthcare, healthcare professionals

备注: Accepted to MediKS@CIKM2025

点击查看摘要

Abstract:Drug recommendation (DR) systems aim to support healthcare professionals in selecting appropriate medications based on patients' medical conditions. State-of-the-art approaches utilize deep learning techniques for improving DR, but fall short in providing any insights on the derivation process of recommendations -- a critical limitation in such high-stake applications. We propose TraceDR, a novel DR system operating over a medical knowledge graph (MKG), which ensures access to large-scale and high-quality information. TraceDR simultaneously predicts drug recommendations and related evidence within a multi-task learning framework, enabling traceability of medication recommendations. For covering a more diverse set of diseases and drugs than existing works, we devise a framework for automatically constructing patient health records and release DrugRec, a new large-scale testbed for DR.

6. 【2510.27259】Research Output of Webology Journal (2013-2017): A Scientometric Analysis

链接https://arxiv.org/abs/2510.27259

作者:Muneer Ahmad,M. Sadik Batcha,Basharat Ahmad Wani,Mohammad Idrees Khan,S. Roselin Jahina

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:World Wide Web, World Wide, Wide Web, international peer-reviewed journal, English devoted

备注: 13 pages, 3 figures, Research Paper

点击查看摘要

Abstract:Webology is an international peer-reviewed journal in English devoted to the field of the World Wide Web and serves as a forum for discussion and experimentation. It serves as a forum for new research in information dissemination and communication processes in general, and in the context of the World Wide Web in particular. This paper presents a Scientometric analysis of the Webology Journal. The paper analyses the pattern of growth of the research output published in the journal, pattern of authorship, author productivity, and subjects covered to the papers over the period (2013-2017). It is found that 62 papers were published during the period of study (2013-2017). The maximum numbers of articles were collaborative in nature. The subject concentration of the journal noted was Social Networking/Web 2.0/Library 2.0 and Scientometrics or Bibliometrics. Iranian researchers contributed the maximum number of articles (37.10%). The study applied standard formula and statistical tools to bring out the factual result.

7. 【2510.27246】Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

链接https://arxiv.org/abs/2510.27246

作者:Mohammad Tavakoli,Alireza Salemi,Carrie Ye,Mohamed Abdalla,Hamed Zamani,J Ross Mitchell

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:cover narrow domains, lack narrative coherence, simple recall-oriented tasks, test simple recall-oriented, large language models

备注

点击查看摘要

Abstract:Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

8. 【2510.27238】DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries

链接https://arxiv.org/abs/2510.27238

作者:Chuxuan Hu,Maxwell Yang,James Weiland,Yeji Lim,Suhas Palawala,Daniel Kang

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Manually conducting real-world, Manually conducting, data, labor-intensive and inefficient, analyses is labor-intensive

备注: Accepted to SIGMOD 2026

点击查看摘要

Abstract:Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users' analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at this https URL.

Comments:
Accepted to SIGMOD 2026

Subjects:

Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2510.27238 [cs.DB]

(or
arXiv:2510.27238v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2510.27238

Focus to learn more

              arXiv-issued DOI via DataCite

Related DOI:

https://doi.org/10.1145/3769781

Focus to learn more

            DOI(s) linking to related resources</p>
9. 【2510.27232】A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation

链接https://arxiv.org/abs/2510.27232

作者:Liyang He,Zhenya Huang,Cheng Yang,Rui Li,Zheng Zhang,Kai Zhang,Zhi Li,Qi Liu,Enhong Chen

类目:Information Retrieval (cs.IR)

关键词:garnered increasing attention, efficient large-scale semantic, efficient large-scale, academia and industry, rapid growth

备注

点击查看摘要

Abstract:With the rapid growth of textual content on the Internet, efficient large-scale semantic text retrieval has garnered increasing attention from both academia and industry. Text hashing, which projects original texts into compact binary hash codes, is a crucial method for this task. By using binary codes, the semantic similarity computation for text pairs is significantly accelerated via fast Hamming distance calculations, and storage costs are greatly reduced. With the advancement of deep learning, deep text hashing has demonstrated significant advantages over traditional, data-independent hashing techniques. By leveraging deep neural networks, these methods can learn compact and semantically rich binary representations directly from data, overcoming the performance limitations of earlier approaches. This survey investigates current deep text hashing methods by categorizing them based on their core components: semantic extraction, hash code quality preservation, and other key technologies. We then present a detailed evaluation schema with results on several popular datasets, followed by a discussion of practical applications and open-source tools for implementation. Finally, we conclude by discussing key challenges and future research directions, including the integration of deep text hashing with large language models to further advance the field. The project for this survey can be accessed at this https URL.

10. 【2510.27157】A Survey on Generative Recommendation: Data, Model, and Tasks

链接https://arxiv.org/abs/2510.27157

作者:Min Hou,Le Wu,Yuxin Liao,Yonghui Yang,Zhen Zhang,Changlong Zheng,Han Wu,Richang Hong

类目:Information Retrieval (cs.IR)

关键词:modern information ecosystems, Recommender systems serve, helping users navigate, users navigate digital, discover items aligned

备注

点击查看摘要

Abstract:Recommender systems serve as foundational infrastructure in modern information ecosystems, helping users navigate digital content and discover items aligned with their preferences. At their core, recommender systems address a fundamental problem: matching users with items. Over the past decades, the field has experienced successive paradigm shifts, from collaborative filtering and matrix factorization in the machine learning era to neural architectures in the deep learning era. Recently, the emergence of generative models, especially large language models (LLMs) and diffusion models, have sparked a new paradigm: generative recommendation, which reconceptualizes recommendation as a generation task rather than discriminative scoring. This survey provides a comprehensive examination through a unified tripartite framework spanning data, model, and task dimensions. Rather than simply categorizing works, we systematically decompose approaches into operational stages-data augmentation and unification, model alignment and training, task formulation and execution. At the data level, generative models enable knowledge-infused augmentation and agent-based simulation while unifying heterogeneous signals. At the model level, we taxonomize LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, we illuminate new capabilities including conversational interaction, explainable reasoning, and personalized content generation. We identify five key advantages: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. We critically examine challenges in benchmark design, model robustness, and deployment efficiency, while charting a roadmap toward intelligent recommendation assistants that fundamentally reshape human-information interaction.

11. 【2510.27141】Compass: General Filtered Search across Vector and Structured Data

链接https://arxiv.org/abs/2510.27141

作者:Chunxiao Ye,Xiao Yan,Eric Lo

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:data necessitates efficient, combine high-dimensional vector, complex relational filtering, necessitates efficient, general filtered search

备注

点击查看摘要

Abstract:The increasing prevalence of hybrid vector and relational data necessitates efficient, general support for queries that combine high-dimensional vector search with complex relational filtering. However, existing filtered search solutions are fundamentally limited by specialized indices, which restrict arbitrary filtering and hinder integration with general-purpose DBMSs. This work introduces \textsc{Compass}, a unified framework that enables general filtered search across vector and structured data without relying on new index designs. Compass leverages established index structures -- such as HNSW and IVF for vector attributes, and B+-trees for relational attributes -- implementing a principled cooperative query execution strategy that coordinates candidate generation and predicate evaluation across modalities. Uniquely, Compass maintains generality by allowing arbitrary conjunctions, disjunctions, and range predicates, while ensuring robustness even with highly-selective or multi-attribute filters. Comprehensive empirical evaluations demonstrate that Compass consistently outperforms NaviX, the only existing performant general framework, across diverse hybrid query workloads. It also matches the query throughput of specialized single-attribute indices in their favorite settings with only a single attribute involved, all while maintaining full generality and DBMS compatibility. Overall, Compass offers a practical and robust solution for achieving truly general filtered search in vector database systems.

12. 【2510.26861】Evaluating Perspectival Biases in Cross-Modal Retrieval

链接https://arxiv.org/abs/2510.26861

作者:Teerapol Saengsukhiran,Peerawat Chomphooyod,Narabodee Rodjananant,Chompakorn Chaksangchaichot,Patawee Prakrankamanant,Witthawin Sripheanpol,Pak Lovichit,SarChaksaana Nutanong,Ekapol Chuangsuwanich

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:semantic space, expected to operate, cultural origin, bias, prevalence

备注

点击查看摘要

Abstract:Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.

13. 【2510.26824】LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

链接https://arxiv.org/abs/2510.26824

作者:Magdalena Lederbauer,Siddharth Betala,Xiyao Li,Ayush Jain,Amine Sehaba,Georgia Channing,Grégoire Germain,Anamaria Leonescu,Faris Flaifil,Alfonso Amayuelas,Alexandre Nozadze,Stefan P. Schmid,Mohd Zaki,Sudheesh Kumar Ethirajan,Elton Pan,Mathilde Franckel,Alexandre Duval,N. M. Anoop Krishnan,Samuel P. Gleason

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:procedural knowledge scattered, synthesis procedures, synthesis procedures remains, systematic analysis, remains a fundamental

备注: 29 pages, 13 figures, 6 tables

点击查看摘要

Abstract:The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.

14. 【2510.25401】DGAI: Decoupled On-Disk Graph-Based ANN Index for Efficient Updates and Queries

链接https://arxiv.org/abs/2510.25401

作者:Jiahao Lou,Quan Yu,Shufeng Gong,Song Yu,Yanfeng Zhang,Ge Yu

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:On-disk graph-based indexes, On-disk graph-based, approximate nearest neighbor, systems for large-scale, graph-based indexes

备注: 12 pages

点击查看摘要

Abstract:On-disk graph-based indexes are widely used in approximate nearest neighbor (ANN) search systems for large-scale, high-dimensional vectors. However, traditional coupled storage methods, which store vectors within the index, are inefficient for index updates. Coupled storage incurs excessive redundant vector reads and writes when updating the graph topology, leading to significant invalid I/O. To address this issue, we propose a decoupled storage architecture. While a decoupled architecture reduces query performance. To overcome this limitation, we design two tailored strategies: (i) a three-stage query mechanism that leverages multiple PQ compressed vectors to filter invalid I/O and computations, and (ii) an incremental page-level topological reordering strategy that incrementally inserts new nodes into pages containing their most similar neighbors to mitigate read amplification. Together, these techniques substantially reduce both I/O and computational overhead during ANN search. Experimental results show that the decoupled architecture improves update speed by 10.05x for insertions and 6.89x for deletions, while the three-stage query and incremental reordering enhance query efficiency by 2.66x compared to the traditional coupled architecture.

计算机视觉

1. 【2510.27692】LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar

链接https://arxiv.org/abs/2510.27692

作者:Soumitra Kundu,Gargi Panda,Saumik Bhattacharya,Aurobinda Routray,Rajlakshmi Guha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unobtrusive cardiac monitoring, radar signals offers, cardiac monitoring, offers a promising, promising approach

备注

点击查看摘要

Abstract:Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) model for radar-to-ECG reconstruction. Unlike prior models that use fixed wavelet approaches, LifWavNet employs learnable lifting wavelets with lifting and inverse lifting units to adaptively capture radar signal features and synthesize physiologically meaningful ECG waveforms. To improve reconstruction fidelity, we introduce a multi-resolution short-time Fourier transform (STFT) loss, that enforces consistency with the ground-truth ECG in both temporal and spectral domains. Evaluations on two public datasets demonstrate that LifWavNet outperforms state-of-the-art methods in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability). Furthermore, intermediate feature visualization highlights the interpretability of multi-resolution decomposition and synthesis in radar-to-ECG reconstruction. These results establish LifWavNet as a robust framework for radar-based non-contact ECG measurement.

2. 【2510.27684】Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

链接https://arxiv.org/abs/2510.27684

作者:Xiangyu Fan,Zesong Qiu,Zhuguanyu Wu,Fanzhou Wang,Zhiqian Lin,Tianxiang Ren,Dahua Lin,Ruihao Gong,Lei Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:distills score-based generative, efficient one-step generators, Phased DMD, DMD, distills score-based

备注

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.

3. 【2510.27680】PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

链接https://arxiv.org/abs/2510.27680

作者:Danyal Maqbool,Changhee Lee,Zachary Huemann,Samuel D. Church,Matthew E. Larson,Scott B. Perlman,Tomas A. Romero,Joshua D. Warner,Meghan Lubner,Xin Tie,Jameson Merkow,Junjie Hu,Steve Y. Cho,Tyler J. Bradshaw

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:enabled impressive multimodal, applications remain limited, Recent advances, impressive multimodal reasoning, medical applications remain

备注

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.

4. 【2510.27677】Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

链接https://arxiv.org/abs/2510.27677

作者:Bo Li,Duyuan Zheng,Xinyang Liu,Qingwen Li,Hong Li,Hongyan Cui,Ge Gao,Chen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poor image quality, Shuffling Vision Transformer, viewpoint distortion, Person re-identification, occluded person ReID

备注: 12 pages,conference

点击查看摘要

Abstract:Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited this http URL support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art this http URL summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.

5. 【2510.27667】Deep learning denoising unlocks quantitative insights in operando materials microscopy

链接https://arxiv.org/abs/2510.27667

作者:Samuel Degnan-Morgenstern,Alexander E. Cohen,Rajeev Gopal,Megan Gober,George J. Nelson,Peng Bai,Martin Z. Bazant

类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)

关键词:govern functional materials, measurement noise limits, undermines quantitative analysis, functional materials, direct insight

备注

点击查看摘要

Abstract:Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.

6. 【2510.27650】Imbalanced Classification through the Lens of Spurious Correlations

链接https://arxiv.org/abs/2510.27650

作者:Jakob Hackstein,Sidney Bender

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Class imbalance poses, amplifies Clever Hans, machine learning, frequently leading, Class imbalance

备注

点击查看摘要

Abstract:Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.

7. 【2510.27649】Gaussian Combined Distance: A Generic Metric for Object Detection

链接https://arxiv.org/abs/2510.27649

作者:Ziqian Guan,Xieyi Fu,Pengjun Huang,Hengyuan Zhang,Hubin Du,Yongtao Liu,Yinglin Wang,Qang Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Wasserstein Distance, well-defined similarity metric, similarity metric, Wasserstein Distance lacks, Gaussian Combined Distance

备注: This paper is accepted by the GRSL in 2025

点击查看摘要

Abstract:In object detection, a well-defined similarity metric can significantly enhance model performance. Currently, the IoU-based similarity metric is the most commonly preferred choice for detectors. However, detectors using IoU as a similarity metric often perform poorly when detecting small objects because of their sensitivity to minor positional deviations. To address this issue, recent studies have proposed the Wasserstein Distance as an alternative to IoU for measuring the similarity of Gaussian-distributed bounding boxes. However, we have observed that the Wasserstein Distance lacks scale invariance, which negatively impacts the model's generalization capability. Additionally, when used as a loss function, its independent optimization of the center attributes leads to slow model convergence and unsatisfactory detection precision. To address these challenges, we introduce the Gaussian Combined Distance (GCD). Through analytical examination of GCD and its gradient, we demonstrate that GCD not only possesses scale invariance but also facilitates joint optimization, which enhances model localization performance. Extensive experiments on the AI-TOD-v2 dataset for tiny object detection show that GCD, as a bounding box regression loss function and label assignment metric, achieves state-of-the-art performance across various detectors. We further validated the generalizability of GCD on the MS-COCO-2017 and Visdrone-2019 datasets, where it outperforms the Wasserstein Distance across diverse scales of datasets. Code is available at this https URL.

8. 【2510.27647】NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

链接https://arxiv.org/abs/2510.27647

作者:Congzhang Shao,Quan Yuan,Guiyang Luo,Yue Hu,Danni Wang,Yilin Liu,Rui Pan,Bo Chen,Jinglin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:common representation, perception improves task, Collaborative perception improves, improves task performance, representation

备注: 19 pages, Accepted by NeurIPS 2025

点击查看摘要

Abstract:Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality's agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.

9. 【2510.27646】VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images

链接https://arxiv.org/abs/2510.27646

作者:Cesar H. Comin,Wesley N. Galvão

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Convolutional Neural Networks, medical image analysis, large annotated datasets, imaging modalities, Semantic segmentation

备注

点击查看摘要

Abstract:Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.

10. 【2510.27632】Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

链接https://arxiv.org/abs/2510.27632

作者:Riccardo Brioschi,Aleksandr Alekseev,Emanuele Nevali,Berkay Döner,Omar El Malki,Blagoj Mitrevski,Leandro Kieliger,Mark Collier,Andrii Maksai,Jesse Berent,Claudiu Musat,Efi Kokiopoulou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Graphic layout generation, generating aesthetically pleasing, aesthetically pleasing layouts, pleasing layouts ranging, growing research area

备注: 15 pages, 18 figures, GitHub link: [this https URL](https://github.com/google-deepmind/sketch_to_layout) , accept at ICCV 2025 Workshop (HiGen)

点击查看摘要

Abstract:Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at this https URL.

11. 【2510.27623】Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

链接https://arxiv.org/abs/2510.27623

作者:Qiusi Zhan,Hyeonjeong Ha,Rui Yang,Sirui Xu,Hanyang Chen,Liang-Yan Gui,Yu-Xiong Wang,Huan Zhang,Heng Ji,Daniel Kang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, large language models, enabling direct perception, planning task-oriented actions, Multimodal large

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.

12. 【2510.27607】Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

链接https://arxiv.org/abs/2510.27607

作者:John Won,Kyungmin Lee,Huiwon Jang,Dongyoung Kim,Jinwoo Shin

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:robotic policy learning, improving robotic policy, policy learning, world modeling, modeling has shown

备注: 20 pages, 10 figures

点击查看摘要

Abstract:Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

13. 【2510.27606】Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

链接https://arxiv.org/abs/2510.27606

作者:Yuhong Liu,Beichen Zhang,Yuhang Zang,Yuhang Cao,Long Xing,Xiaoyi Dong,Haodong Duan,Dahua Lin,Jiaqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, Vision-Language Models, weakness of Large, Large Vision-Language, remains a weakness

备注: preprint

点击查看摘要

Abstract:Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

14. 【2510.27602】Who Made This? Fake Detection and Source Attribution with Diffusion Features

链接https://arxiv.org/abs/2510.27602

作者:Simone Bonechi,Paolo Andreini,Barbara Toniella Corradini

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:raising concerns, concerns about authenticity, rapid progress, progress of generative, enabled the creation

备注

点击查看摘要

Abstract:The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.

15. 【2510.27599】ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning

链接https://arxiv.org/abs/2510.27599

作者:Samarup Bhattacharya,Anubhab Bhattacharya,Abir Chakraborty

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural networks, interpret the world, networks have changed, machines interpret, model

备注: 11 pages, 1 figure

点击查看摘要

Abstract:Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.

16. 【2510.27584】Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

链接https://arxiv.org/abs/2510.27584

作者:Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Alexis Joly

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large-scale retrieval requires, retrieval requires representations, Efficient large-scale retrieval, compact and discriminative, large-scale retrieval

备注

点击查看摘要

Abstract:Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

17. 【2510.27571】owards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

链接https://arxiv.org/abs/2510.27571

作者:Zhuoning Guo,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Xiaowen Chu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:narrow benchmarks incentivize, benchmarks incentivize correspondingly, incentivize correspondingly limited, prevailing video retrieval, video retrieval paradigm

备注

点击查看摘要

Abstract:The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

18. 【2510.27547】MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series

链接https://arxiv.org/abs/2510.27547

作者:Xue Xia,Randall Balestriero,Tao Zhang,Yixin Zhou,Andrew Ding,Dev Saini,Lorenz Hurni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:historical map images, historical map, time series, document geographic features, historical map time

备注

点击查看摘要

Abstract:Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.

19. 【2510.27533】Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds

链接https://arxiv.org/abs/2510.27533

作者:Khandoker Ashik Uz Zaman,Mohammad Zahangir Alam,Mohammed N. M. Ali,Mahdi H. Miraz

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:digital media, intellectual property, critical due, rapid growth, growth of three-dimensional

备注

点击查看摘要

Abstract:The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.

20. 【2510.27508】Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation

链接https://arxiv.org/abs/2510.27508

作者:Elena Mulero Ayllón,Linlin Shen,Pierangelo Veltri,Fabrizia Gelardi,Arturo Chiti,Paolo Soda,Matteo Tortora

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:effectively combining anatomical, Accurate lung tumor, Cross-Modal Perception Module, treatment planning, major challenge

备注

点击查看摘要

Abstract:Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at this https URL.

21. 【2510.27492】hinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2510.27492

作者:Jiawei Gu,Yunzhuo Hao,Huichen Will Wang,Linjie Li,Michael Qizhe Shieh,Yejin Choi,Ranjay Krishna,Yu Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires iterative coordination, meaningful interleaved chain, reasoning requires iterative, language and vision, requires iterative

备注: project page: [this https URL](https://thinkmorph.github.io/)

点击查看摘要

Abstract:Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal this http URL findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

22. 【2510.27481】NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

链接https://arxiv.org/abs/2510.27481

作者:Wei Xu,Cheng Wang,Dingkang Liang,Zongchuang Zhao,Xingyu Jiang,Peng Zhang,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:underwater scene understanding, offers critical insights, attracts increasing attention, exploration offers critical, scene understanding

备注: Accepted to NeurIPS 2025. Data and models are available at [this https URL](https://github.com/H-EmbodVis/NAUTILUS)

点击查看摘要

Abstract:Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at this https URL.

23. 【2510.27475】Referee: Reference-aware Audiovisual Deepfake Detection

链接https://arxiv.org/abs/2510.27475

作者:Hyemin Boo,Eunsang Lee,Jiyoung Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:advanced generative models, detection approaches struggle, existing audiovisual deepfake, deepfake detection approaches, audiovisual deepfake detection

备注: In Progress

点击查看摘要

Abstract:Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at this https URL.

24. 【2510.27460】A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning

链接https://arxiv.org/abs/2510.27460

作者:Casper Fibaek,Abi Riley,Kelsey Doerksen,Do-Hyung Kim,Rochelle Schneider

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:educational facility records, facility records, infrequently updated, paper presents, accuracy and completeness

备注

点击查看摘要

Abstract:This paper presents a multi-tiered human-in-the-loop framework for interactive school mapping designed to improve the accuracy and completeness of educational facility records, particularly in developing regions where such data may be scarce and infrequently updated. The first tier involves a machine learning based analysis of population density, land cover, and existing infrastructure compared with known school locations. The first tier identifies potential gaps and "mislabelled" schools. In subsequent tiers, medium-resolution satellite imagery (Sentinel-2) is investigated to pinpoint regions with a high likelihood of school presence, followed by the application of very high-resolution (VHR) imagery and deep learning models to generate detailed candidate locations for schools within these prioritised areas. The medium-resolution approach was later removed due to insignificant improvements. The medium and VHR resolution models build upon global pre-trained steps to improve generalisation. A key component of the proposed approach is an interactive interface to allow human operators to iteratively review, validate, and refine the mapping results. Preliminary evaluations indicate that the multi-tiered strategy provides a scalable and cost-effective solution for educational infrastructure mapping to support planning and resource allocation.

25. 【2510.27452】From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

链接https://arxiv.org/abs/2510.27452

作者:Jianwen Sun,Fanrui Zhang,Yukang Feng,Chuanhao Li,Zizhen Li,Jiaxin Ai,Yifan Chang,Yu Dai,Kaipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high information density, density and post-editability, Scientific illustrations demand, demand both high, high information

备注

点击查看摘要

Abstract:Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.

26. 【2510.27442】CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

链接https://arxiv.org/abs/2510.27442

作者:Aon Safdar,Mohamed Saadeldin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real-world clinical scenarios, high computational demands, generalizable Vision Transformer, Vision Transformer architecture, small datasets limit

备注: Preprint (submitted manuscript). Accepted at the MICCAI 2025 MIRASOL Workshop; to appear in the Springer proceedings volume. This is the pre-review version (not the Version of Record). DOI will be added after publication. [Optional: 8 pages, 4 figures, 4 tables.]

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.

27. 【2510.27439】DeblurSDI: Blind Image Deblurring Using Self-diffusion

链接https://arxiv.org/abs/2510.27439

作者:Yanlong Yang,Guanxiong Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ill-posed inverse problem, challenging ill-posed inverse, inverse problem, challenging ill-posed, ill-posed inverse

备注

点击查看摘要

Abstract:Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.

28. 【2510.27432】Mitigating Semantic Collapse in Partially Relevant Video Retrieval

链接https://arxiv.org/abs/2510.27432

作者:WonJun Moon,MinSeok Jung,Gilhan Park,Tae-Young Kim,Cheol-Ho Cho,Woojin Jun,Jae-Pil Heo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Partially Relevant Video, Partially Relevant, Relevant Video Retrieval, Relevant Video, content matches

备注: Accpeted to NeurIPS 2025. Code is available at [this https URL](https://github.com/admins97/MSC_PRVR)

点击查看摘要

Abstract:Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.

29. 【2510.27421】Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset

链接https://arxiv.org/abs/2510.27421

作者:Aditya Parikh,Sneha Das,Aasa Feragen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Deep learning models, improve diagnostic workflows, evaluation remains underexplored, Deep learning, learning models aim

备注: Medical Imaging Meets EurIPS (NeurIPS-endorsed workshop) - MedEurIPS

点击查看摘要

Abstract:Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.

30. 【2510.27392】A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection

链接https://arxiv.org/abs/2510.27392

作者:Sales Aribe Jr

类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:media increasingly realistic, raising societal concerns, made synthetic media, synthetic media increasingly, generative adversarial networks

备注: 11 pages, 13 figures, 9 tables, Published with International Journal of Advanced Computer Science and Applications (IJACSA)

点击查看摘要

Abstract:The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.

31. 【2510.27391】Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

链接https://arxiv.org/abs/2510.27391

作者:Wu Wei,Xiaomeng Fan,Yuwei Wu,Zhi Gao,Pengxiang Li,Yunde Jia,Mehrtash Harandi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Modality alignment, effectively integrate information, critical for vision-language, integrate information, Modality

备注

点击查看摘要

Abstract:Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

32. 【2510.27364】Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

链接https://arxiv.org/abs/2510.27364

作者:Meftun Akarsu,Kerem Catay,Sedat Bin Vedat,Enes Kutay Yarkan,Ilke Senturk,Arda Sar,Dafne Eksioglu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:fine-tuning open-source video, open-source video diffusion, video diffusion transformers, synthesize cinematic scenes, present a practical

备注: video generation, image-to-video, dif- fusion transformer, LoRA, fine-tuning, cinematic scene synthesis, multi-GPU inference, fully sharded data parallelism, computational efficiency

点击查看摘要

Abstract:We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.

33. 【2510.27359】FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning

链接https://arxiv.org/abs/2510.27359

作者:Kenneth Yang,Wen-Li Wei,Jen-Chun Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:face notable limitations, existing approaches face, approaches face notable, Gradient-based Parameter Selection, notable limitations

备注

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.

34. 【2510.27350】RzenEmbed: Towards Comprehensive Multimodal Retrieval

链接https://arxiv.org/abs/2510.27350

作者:Weijian Jian,Yajun Zhang,Dawei Liang,Chunyu Xie,Yixiao He,Dawei Leng,Yuhui Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, extended CLIP-based frameworks, Multimodal Large

备注

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in this https URL.

35. 【2510.27335】Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

链接https://arxiv.org/abs/2510.27335

作者:Yijia Wang,Yiqing Shen,Weiming Chen,Zhihai He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing image editing, Existing image, textbf, large language models, handle simple editing

备注

点击查看摘要

Abstract:Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{this https URL}{this https URL}.

36. 【2510.27326】MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI

链接https://arxiv.org/abs/2510.27326

作者:Benjamin Hamm,Yannick Kirchhoff,Maximilian Rokuss,Klaus Maier-Hein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ODELIA Breast MRI, breast MRI scans, Breast MRI Challenge, Breast MRI, breast cancer detection

备注: Winning Solution of the MICCAI 2025 ODELIA Breast MRI Classification Challenge

点击查看摘要

Abstract:The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at this https URL

37. 【2510.27324】Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis

链接https://arxiv.org/abs/2510.27324

作者:Weiming Chen,Yijia Wang,Zhihan Zhu,Zhihai He

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deep space exploration, remote vision analysis, vision analysis, bit rate visual, bit rate

备注

点击查看摘要

Abstract:We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.

38. 【2510.27318】SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction

链接https://arxiv.org/abs/2510.27318

作者:Wenfeng Huang,Xiangyun Liao,Yinling Qian,Hao Liu,Yongming Yang,Wenjing Jia,Qiong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Surgical reconstruction, robot-assisted surgery, crucial technology, technology in robot-assisted

备注

点击查看摘要

Abstract:Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.

39. 【2510.27316】Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection

链接https://arxiv.org/abs/2510.27316

作者:Zijia An,Boyu Diao,Ruiqi Liu,Libo Huang,Chuanguang Yang,Fei Wang,Zhulin An,Yongjun Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pretrained models enables, models enables effective, Recent studies, effective incremental learning, enables effective incremental

备注

点击查看摘要

Abstract:Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.

40. 【2510.27315】CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram

链接https://arxiv.org/abs/2510.27315

作者:Alvee Hassan,Rusab Sarmun,Muhammad E. H. Chowdhury,M. Murugappan,Md. Sakib Abrar Hossain,Sakib Mahmud,Abdulrahman Alqahtani,Sohaib Bassam Zoghoul,Amith Khandakar,Susu M. Zughaier,Somaya Al-Maadeed,Anwarul Hasan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:coronary artery disease, Early detection, Coronary Artery Segmentation, improving patient treatment, coronary artery

备注

点击查看摘要

Abstract:Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.

41. 【2510.27296】Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba

链接https://arxiv.org/abs/2510.27296

作者:Wenfeng Huang,Xiangyun Liao,Wei Cao,Wenjing Jia,Weixin Si

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhancing diagnostic accuracy, reducing acquisition cost, scanning time, Medical image super-resolution, essential for enhancing

备注

点击查看摘要

Abstract:Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ($$0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.

42. 【2510.27285】Rethinking Robust Adversarial Concept Erasure in Diffusion Models

链接https://arxiv.org/abs/2510.27285

作者:Qinghong Yin,Yu Tian,Yue Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:unlearning undesirable content, Concept, Concept erasure aims, selectively unlearning undesirable, sensitive content generation

备注

点击查看摘要

Abstract:Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at this https URL.

43. 【2510.27280】FOCUS: Efficient Keyframe Selection for Long Video Understanding

链接https://arxiv.org/abs/2510.27280

作者:Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Zhenheng Yang,Yang You

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multimodal large language, Multimodal large, large language models, keyframe selection, large language

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2510.27280 [cs.CV]

(or
arXiv:2510.27280v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.27280

Focus to learn more

              arXiv-issued DOI via DataCite</p>
44. 【2510.27266】HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

链接https://arxiv.org/abs/2510.27266

作者:Shaojie Zhang,Pei Fu,Ruoceng Zhang,Jiahui Yang,Anan Du,Xiuwen Xi,Shaokang Wang,Ying Huang,Bin Qin,Zhenbo Luo,Jian Luan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous Graphical User, Graphical User Interface, execute user commands, Autonomous Graphical, User Interface

备注

点击查看摘要

Abstract:Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.

45. 【2510.27265】3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

链接https://arxiv.org/abs/2510.27265

作者:Raza Imam,Hu Wang,Dwarikanath Mahapatra,Mohammad Yaqub

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:pretrained networks offer, networks offer broad, achieve high in-distribution, vision-language models face, fine-tuned expert models

备注: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures

点击查看摘要

Abstract:In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at this https URL.

46. 【2510.27261】RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents

链接https://arxiv.org/abs/2510.27261

作者:Yinglu Li,Zhiying Lu,Zhihang Liu,Chuanbin Liu,Hongtao Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal Retrieval-Augmented Generation, Multi-modal Retrieval-Augmented, Retrieval-Augmented Generation, leveraging candidate visual, candidate visual documents

备注

点击查看摘要

Abstract:Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at this https URL.

47. 【2510.27255】Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

链接https://arxiv.org/abs/2510.27255

作者:Yehna Kim andYoung-Eun Kim,Seong-Whan Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:associate video embeddings, demonstrated impressive capabilities, class embeddings, recognition by learning, learning to associate

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.

48. 【2510.27249】C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

链接https://arxiv.org/abs/2510.27249

作者:Suklav Ghosh,Sonal Kumar,Arijit Sur

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, computer vision tasks, object detection, achieved remarkable, remarkable success

备注

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model's parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model's robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.

49. 【2510.27245】rans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation

链接https://arxiv.org/abs/2510.27245

作者:Alik Pramanick,Mayank Bansal,Utkarsh Srivastava,Suklav Ghosh,Arijit Sur

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent times, deep neural networks, successfully adopted, adversarial attacks, denoising network

备注

点击查看摘要

Abstract:In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier's robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at this https URL.

50. 【2510.27237】Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis

链接https://arxiv.org/abs/2510.27237

作者:Zhidong Yang,Xiuhui Shi,Wei Ba,Zhigang Song,Haijing Luan,Taiyuan Hu,Senlin Lin,Jiguang Wang,Shaohua Kevin Zhou,Rui Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly essential technique, slide image, analysis has emerged, computational pathology, increasingly essential

备注: 22 pages, 9 figures

点击查看摘要

Abstract:Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs' embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.

51. 【2510.27236】Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting

链接https://arxiv.org/abs/2510.27236

作者:Tianli Liao,Ran Wang,Siqing Zhang,Lei Li,Guangen Liu,Chenyang Zhao,Heling Cao,Peng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Eliminating geometric distortion, semantically important regions, important regions remains, input image, Eliminating geometric

备注: Publish in Pattern Recognition

点击查看摘要

Abstract:Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input's geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at this https URL.

52. 【2510.27234】MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

链接https://arxiv.org/abs/2510.27234

作者:Jingnan Gao,Zhe Wang,Xianze Fang,Xingyu Ren,Zhuo Chen,Shengqi Liu,Yuhao Cheng,Jiangjing Lyu,Xiaokang Yang,Yichao Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, model capacity consistently, capacity consistently improves, advances in language, language and vision

备注: Project Page: [this https URL](https://g-1nonly.github.io/MoRE_Website/) , Code: [this https URL](https://github.com/alibaba/Taobao3D)

点击查看摘要

Abstract:Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.

53. 【2510.27224】Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

链接https://arxiv.org/abs/2510.27224

作者:Mahmoud El Hussieni,Bahadır K. Güntürk,Hasan F. Ateş,Oğuz Hanoğlu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate building instance, Accurate building, infrastructure monitoring, Accurate, height classification

备注

点击查看摘要

Abstract:Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset -- which includes over 125,000 annotated buildings across 12 cities -- we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4\% mAP@50 and 38.3\% mAP@50--95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.

54. 【2510.27222】Soft Task-Aware Routing of Experts for Equivariant Representation Learning

链接https://arxiv.org/abs/2510.27222

作者:Jaebyeong Jeon,Hyeonseo Jang,Jy-yong Sohn,Kibok Lee

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:capture variations induced, learning encodes semantic, encodes semantic information, input transformations, representation learning aims

备注: NeurIPS 2025

点击查看摘要

Abstract:Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at this https URL.

55. 【2510.27219】SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping

链接https://arxiv.org/abs/2510.27219

作者:Renjie Ji,Xue Wang,Chao Niu,Wen Zhang,Yong Mei,Kun Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:HSI, vital tool, tool for fine-grained, fine-grained land-use, HSI foundation models

备注

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.

56. 【2510.27213】Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness

链接https://arxiv.org/abs/2510.27213

作者:Ren Tasai,Guang Li,Ren Togo,Takahiro Ogawa,Kenji Hirata,Minghui Tang,Takaaki Yoshimura,Hiroyuki Sugimori,Noriko Nishioka,Yukie Shimizu,Kohsuke Kudo,Miki Haseyama

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:simultaneously learning diverse, chest computed tomography, continual self-supervised learning, learning diverse features, computed tomography

备注

点击查看摘要

Abstract:We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.

57. 【2510.27210】GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

链接https://arxiv.org/abs/2510.27210

作者:Tao Liu,Chongyu Wang,Rongjie Li,Yingchen Yu,Xuming He,Bai Song

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注: Published in NeurIPS 2025

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at this https URL.

58. 【2510.27208】Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

链接https://arxiv.org/abs/2510.27208

作者:Jiaxin Zhang,Zehong Zhu,Junye Deng,Yunqin Li,and Bowen Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Villages areas hold, areas hold significant, hold significant importance, villages spatial morphology, human-land relationships

备注

点击查看摘要

Abstract:Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.

59. 【2510.27195】Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

链接https://arxiv.org/abs/2510.27195

作者:Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Yoichi Sato

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:critical frontier, robust social intelligence, increasingly integrated, human lives, Multimodal Large Language

备注

点击查看摘要

Abstract:As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

60. 【2510.27186】Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

链接https://arxiv.org/abs/2510.27186

作者:Zixuan Hu,Yongxian Wei,Li Shen,Zhenyi Wang,Lei Li,Chun Yuan,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:original training data, training data, pre-trained discriminative models, original training, large-scale Vision Transformers

备注

点击查看摘要

Abstract:Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at this https URL.

61. 【2510.27181】Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

链接https://arxiv.org/abs/2510.27181

作者:Guozheng Zheng,Jian Guan,Mingjie Xie,Xuanjia Zhao,Congyi Fan,Shiheng Zhang,Pengming Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:satellite imagery remains, imagery remains challenging, remains challenging due, severe viewpoint gaps, geographically mismatched samples

备注: 5 pages, 3 figures

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.

62. 【2510.27179】SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles

链接https://arxiv.org/abs/2510.27179

作者:Guanchong Huang,Song Fang

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:significant privacy threat, religious beliefs, political leanings, sexual orientation, victims watch

备注: 16 pages, 29 figures. Accepted at 26th Privacy Enhancing Technologies Symposium (PETS 2026)

点击查看摘要

Abstract:Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.

63. 【2510.27171】H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

链接https://arxiv.org/abs/2510.27171

作者:Mingyu Sung,Il-Min Kim,Sangseok Yun,Jae-Mo Kang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:iterative denoising process, significant computational cost, deployment is hindered, computational cost, Diffusion models

备注

点击查看摘要

Abstract:Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at this https URL.

64. 【2510.27169】DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model

链接https://arxiv.org/abs/2510.27169

作者:Yucheng Xing,Jinxing Yin,Xiaodong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual generation tasks, video generation, video generation tasks, video diffusion model, shown their impressive

备注

点击查看摘要

Abstract:Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.

65. 【2510.27166】M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

链接https://arxiv.org/abs/2510.27166

作者:Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sensors provide dense, provide dense semantic, Recent advances, camera sensors provide, dense semantic information

备注: 16 pages, 9 figures

点击查看摘要

Abstract:Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

66. 【2510.27164】Generating Accurate and Detailed Captions for High-Resolution Images

链接https://arxiv.org/abs/2510.27164

作者:Hankyeol Lee,Gawon Seo,Kyounggyu Lee,Dogun Kim,Kyungwoo Song,Jiyoung Jung

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-language models, low-resolution inputs, struggle to generate, generate accurate, typically pre-trained

备注: Work conducted in 2024; released for archival purposes

点击查看摘要

Abstract:Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.

67. 【2510.27158】How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

链接https://arxiv.org/abs/2510.27158

作者:Yanfan Zhu,Juming Xiong,Ruining Deng,Yu Wang,Yaohong Wang,Shilin Zhao,Mengmeng Yin,Yuqing Liu,Haichun Yang,Yuankai Huo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inter-observer variability present, variability present significant, present significant challenges, renal transplant biopsies, evaluating renal transplant

备注

点击查看摘要

Abstract:The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.

68. 【2510.27155】AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification

链接https://arxiv.org/abs/2510.27155

作者:Yuanhao Tang,Xuechao Zou,Zhengpei Hu,Junliang Xing,Chengkun Zhang,Jianqiang Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote sensing image, complex spatial structures, Remote sensing, sensing image scene, image scene classification

备注

点击查看摘要

Abstract:Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at this https URL.

69. 【2510.27148】HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition

链接https://arxiv.org/abs/2510.27148

作者:Jiacheng Hong,Kunzhen Wu,Mingrui Yu,Yichao Gu,Shengze Xue,Shuangjiu Xiao,Deli Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:holds significant potential, Three-dimensional scene generation, generation holds significant, Three-dimensional scene, potential in gaming

备注

点击查看摘要

Abstract:Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.

70. 【2510.27139】Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features

链接https://arxiv.org/abs/2510.27139

作者:Xingtao Ling Yingying Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently gained attention, gained attention due, spatial relationship feature, relationship feature maps, Spatial Attention Module

备注

点击查看摘要

Abstract:Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the "Ground-to-Drone" localization task, enriching existing datasets and filling the gap in "Ground-to-Drone" localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.

71. 【2510.27135】E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

链接https://arxiv.org/abs/2510.27135

作者:Tong Shen,Jingai Yu,Dong Zhou,Dong Li,Emad Barsoum

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating high-quality images, shown strong capabilities, multimodal diffusion model, Multimodal Diffusion, Efficient Multimodal Diffusion

备注

点击查看摘要

Abstract:Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at this https URL and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

72. 【2510.27133】WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond

链接https://arxiv.org/abs/2510.27133

作者:Zhicong Sun,Jacqueline Lo,Jinxing Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Gaussian splatting, subsequent variants, variants have led, led to remarkable, remarkable progress

备注: This paper has been accepted by MMM 2026

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: this https URL.

73. 【2510.27128】ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding

链接https://arxiv.org/abs/2510.27128

作者:Haonan Wang,Jingyu Lu,Hongrui Li,Xiaomeng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advances, computer vision, promising bridge, bridge between neuroscience, neuroscience and computer

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: this https URL.

74. 【2510.27088】Hierarchical Transformers for Unsupervised 3D Shape Abstraction

链接https://arxiv.org/abs/2510.27088

作者:Aditya Vora,Lily Goli,Andrea Tagliasacchi,Hao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neural field representation, hierarchical neural field, neural field, field representation, introduce HiT

备注

点击查看摘要

Abstract:We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

75. 【2510.27047】AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

链接https://arxiv.org/abs/2510.27047

作者:Mario Camarena,Het Patel,Fatemeh Nazari,Evangelos Papalexakis,Mohamadhossein Noruzoliaee,Jia Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous Driving Segment, Autonomous Driving, Driving Segment, fine-tuned vision foundation, presents the Autonomous

备注: Submitted to IEEE Transactions on Intelligent Transportation Systems (IEEE T-ITS)

点击查看摘要

Abstract:This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM's pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.

76. 【2510.27033】A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

链接https://arxiv.org/abs/2510.27033

作者:Simindokht Jahangard,Mehrzad Mohammadi,Abhinav Dhall,Hamid Rezatofighi

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:requires understanding object, challenging cognitive task, understanding object relationships, Visual reasoning, challenging cognitive

备注

点击查看摘要

Abstract:Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.

77. 【2510.27028】VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video

链接https://arxiv.org/abs/2510.27028

作者:Philipp V. Rouast

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:estimating physiological signals, deep learning model, Heart Rate Variability, report introduces VitalLens, face video

备注: Technical Report. 8 pages, 5 figures. Introduces the VitalLens 2.0 model for rPPG and Heart Rate Variability (HRV) estimation. Project website: [this https URL](https://rouast.com/api)

点击查看摘要

Abstract:This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at this https URL.

78. 【2510.27020】Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

链接https://arxiv.org/abs/2510.27020

作者:Yana Wei,Zeen Chi,Chongyu Wang,Yu Wu,Shipeng Yan,Yongfei Liu,Xuming He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging conventional closed-world, conventional closed-world HOI, HOI detection models, closed-world HOI detection, evolve continuously

备注

点击查看摘要

Abstract:In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{this https URL}{this HTTP URL}

79. 【2510.26996】MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

链接https://arxiv.org/abs/2510.26996

作者:Arghavan Rezvani,Xiangyi Yan,Anthony T. Wu,Kun Han,Pooya Khosravi,Xiaohui Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Language Medical Experts, Large Language Models, Visual Language Medical, Large Language, Visual Language

备注

点击查看摘要

Abstract:In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

80. 【2510.26978】Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

链接https://arxiv.org/abs/2510.26978

作者:Anam Fatima,Yi Yu,Janak Kapuriya,Julien Lalanne,Jainendra Shukla

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:enhancing viewer engagement, video, surged in popularity, popularity on platforms, engagement through dynamic

备注

点击查看摘要

Abstract:Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.

81. 【2510.26967】Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR

链接https://arxiv.org/abs/2510.26967

作者:Riley Grossman,Michael Smith,Cristian Borcea,Yi Chen

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:General Data Protection, personal data sharing, permits personal data, draw users' attention, General Data

备注: Accepted to International AAAI Conference on Web and Social Media 2026 (ICWSM'26)

点击查看摘要

Abstract:The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.

82. 【2510.26961】SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions

链接https://arxiv.org/abs/2510.26961

作者:Md. Mehedi Hassan,Shafqat Alam,Shahriar Ahmed Seam,Maruf Ahmed

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:multi-modal MRI remains, multi-modal MRI, MRI remains, heterogeneous brain lesions, lesions from multi-modal

备注: 17 pages, 10 figures, 8 tables, submitted to "Medical Image Analysis" journal

点击查看摘要

Abstract:Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions' that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at this https URL.

83. 【2510.26923】Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11

链接https://arxiv.org/abs/2510.26923

作者:Yi Luo,Yike Guo,Hamed Hooshangnejad,Kai Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:approaches face challenges, existing deep learning, deep learning approaches, learning approaches face, lung cancer diagnosis

备注: 5 pages, 2 figures

点击查看摘要

Abstract:Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.

84. 【2510.26921】DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting

链接https://arxiv.org/abs/2510.26921

作者:Moonsoo Jeong,Dongbeen Kim,Minseong Kim,Sungkil Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adaptive Density Control, driven Adaptive Density, Gaussian Splatting, Directional Consistency, Density Control

备注: Accepted to NeurIPS 2025 / Project page: [this https URL](https://github.com/cgskku/dc4gs)

点击查看摘要

Abstract:We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.

85. 【2510.26903】PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT

链接https://arxiv.org/abs/2510.26903

作者:Rochak Dhakal,Chen Zhao,Zixin Shi,Joyce H. Keyak,Tadashi S. Kaneko,Kuan-Jui Su,Hui Shen,Hong-Wen Deng,Weihua Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:assessing bone strength, bone density distribution, Quantitative computed tomography, assessing bone, bone strength

备注: 22 Pages, 5 Tables, 10 Figures. The combination of GRL and MMD achieved the most balanced performance, reducing contour deviations and enhancing surface smoothness

点击查看摘要

Abstract:Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.

86. 【2510.26865】Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

链接https://arxiv.org/abs/2510.26865

作者:Fenfen Lin,Yesheng Liu,Haiyu Xu,Chen Yue,Zheqi He,Mingxuan Zhao,Miguel Hu Chen,Jiakang Liu,JG Yao,Xi Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains surprisingly challenging, Reading measurement instruments, domain expertise, instruments is effortless, effortless for humans

备注: Project page: [this https URL](https://flageval-baai.github.io/MeasureBenchPage/)

点击查看摘要

Abstract:Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.

87. 【2510.26825】Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

链接https://arxiv.org/abs/2510.26825

作者:Jiarong Du,Zhan Jin,Peijun Yang,Juan Liu,Zhuo Li,Xin Liu,Ming Li

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:visual auxiliary information, target speaker speech, Audio-visual speech enhancement, mixed audio, complex acoustic environments

备注

点击查看摘要

Abstract:Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

88. 【2505.01313】A Neural Architecture Search Method using Auxiliary Evaluation Metric based on ResNet Architecture

链接https://arxiv.org/abs/2505.01313

作者:Shang Wang,Huanrong Tang,Jianquan Ouyang

类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:fully connected layers, objectives including parameters, search objectives including, neural architecture search, parameters for convolution

备注: GECCO 2023

点击查看摘要

Abstract:This paper proposes a neural architecture search space using ResNet as a framework, with search objectives including parameters for convolution, pooling, fully connected layers, and connectivity of the residual network. In addition to recognition accuracy, this paper uses the loss value on the validation set as a secondary objective for optimization. The experimental results demonstrate that the search space of this paper together with the optimisation approach can find competitive network architectures on the MNIST, Fashion-MNIST and CIFAR100 datasets.

89. 【2510.27679】Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models

链接https://arxiv.org/abs/2510.27679

作者:Joyoni Dey,Hunter C. Meyer,Murtuza S. Taqi

类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optics (physics.optics)

关键词:accessibility remain limited, Low-dose computed tomography, Lung Screening Trial, National Lung Screening, lung cancer screening

备注

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels. Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.

90. 【2510.27307】A fragile zero-watermarking method based on dual quaternion matrix decomposition

链接https://arxiv.org/abs/2510.27307

作者:Mingcui Zhang,Zhigang Jia

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词:remote consultation, Medical images play, assisting diagnosis, academic research, Medical images

备注: 18 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.

91. 【2510.26907】Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction

链接https://arxiv.org/abs/2510.26907

作者:Meghraj Prajapat,Alankar Alankar

类目:Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)

关键词:Electron back-scatter diffraction, Electron back-scatter, Hough transform, interpret diffraction patterns, extract crystallographic orientation

备注

点击查看摘要

Abstract:Electron back-scatter diffraction (EBSD) has traditionally relied upon methods such as the Hough transform and dictionary Indexing to interpret diffraction patterns and extract crystallographic orientation. However, these methods encounter significant limitations, particularly when operating at high scanning speeds, where the exposure time per pattern is decreased beyond the operating sensitivity of CCD camera. Hence the signal to noise ratio decreases for the observed pattern which makes the pattern noisy, leading to reduced indexing accuracy. This research work aims to develop generative machine learning models for the post-processing or on-the-fly processing of Kikuchi patterns which are capable of restoring noisy EBSD patterns obtained at high scan speeds. These restored patterns can be used for the determination of crystal orientations to provide reliable indexing results. We compare the performance of such generative models in enhancing the quality of patterns captured at short exposure times (high scan speeds). An interesting observation is that the methodology is not data-hungry as typical machine learning methods.

92. 【2510.26819】See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

链接https://arxiv.org/abs/2510.26819

作者:Jinting Wang,Jun Wang,Hei Victor Cheng,Li Liu

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:addressing key challenges, directly extracts information, Unlike existing methods, addressing key, Unlike existing

备注: 16 pages,15 figures, accepted by TASLP

点击查看摘要

Abstract:Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.