本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新552篇论文,其中:

  • 自然语言处理89
  • 信息检索14
  • 计算机视觉102

自然语言处理

1. 【2510.12784】SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

链接https://arxiv.org/abs/2510.12784

作者:Weiyang Jin,Yuwei Niu,Jiaqi Liao,Chengqi Duan,Aoxue Li,Shenghua Gao,Xihui Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Unified Multimodal Models, Unified Multimodal, made in Unified, integrate vision-language generation, Multimodal Models

备注: 20 pages, 8 figures, webpage can be seen in [this https URL](https://waynejin0918.github.io/srum_web/)

点击查看摘要

Abstract:Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a \textbf{global reward} ensures the correctness of the overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.

2. 【2510.12781】Cost Analysis of Human-corrected Transcription for Predominately Oral Languages

链接https://arxiv.org/abs/2510.12781

作者:Yacouba Diarra,Nouhoum Souleymane Coulibaly,Michael Leventhal

类目:Computation and Language (cs.CL)

关键词:Creating speech datasets, poorly understood challenge, Predominately Oral Languages, Creating speech, literacy Predominately Oral

备注: 6 pages, 1 figure

点击查看摘要

Abstract:Creating speech datasets for low-resource languages is a critical yet poorly understood challenge, particularly regarding the actual cost in human labor. This paper investigates the time and complexity required to produce high-quality annotated speech data for a subset of low-resource languages, low literacy Predominately Oral Languages, focusing on Bambara, a Manding language of Mali. Through a one-month field study involving ten transcribers with native proficiency, we analyze the correction of ASR-generated transcriptions of 53 hours of Bambara voice data. We report that it takes, on average, 30 hours of human labor to accurately transcribe one hour of speech data under laboratory conditions and 36 hours under field conditions. The study provides a baseline and practical insights for a large class of languages with comparable profiles undertaking the creation of NLP resources.

3. 【2510.12780】Content Anonymization for Privacy in Long-form Audio

链接https://arxiv.org/abs/2510.12780

作者:Cristina Aggazzotti,Ashi Garg,Zexin Cai,Nicholas Andrews

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:speaker acoustic identity, VoicePrivacy Challenge, Voice anonymization techniques, identity in short, found to successfully

备注

点击查看摘要

Abstract:Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.

4. 【2510.12773】Dr.LLM: Dynamic Layer Routing in LLMs

链接https://arxiv.org/abs/2510.12773

作者:Ahmed Heakl,Martin Gubri,Salman Khan,Sangdoo Yun,Seong Joon Oh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, causing wasted computation, Language Models, process every token

备注: 17 pages, Under submission

点击查看摘要

Abstract:Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce this http URL, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), this http URL improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, this http URL shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

5. 【2510.12766】Language Models Model Language

链接https://arxiv.org/abs/2510.12766

作者:Łukasz Borchmann

类目:Computation and Language (cs.CL)

关键词:Saussure and Chomsky, heavily influenced, speculative and unproductive, Linguistic commentary, Chomsky

备注

点击查看摘要

Abstract:Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for "deep structure" or "grounding" to achieve an idealized linguistic "competence." We argue for a radical shift in perspective towards the empiricist principles of Witold Mańczak, a prominent general and historical linguist. He defines language not as a "system of signs" or a "computational system of the brain" but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language's primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.

6. 【2510.12740】Hey, wait a minute: on at-issue sensitivity in Language Models

链接https://arxiv.org/abs/2510.12740

作者:Sanghee J. Kim,Kanishka Misra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:metrics remain limited, scalable quantitative metrics, quantitative metrics remain, naturalness' vary, remain limited

备注: 10 pages, 5 figures, 3 tables. See [this https URL](https://github.com/sangheek16/hey-wait-a-minute) for code and data

点击查看摘要

Abstract:Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of 'naturalness' vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of 'at-issueness' to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., "Hey, wait a minute") are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.

7. 【2510.12722】Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages

链接https://arxiv.org/abs/2510.12722

作者:Nadine El-Naggar,Tatsuki Kuribayashi,Ted Briscoe

类目:Computation and Language (cs.CL)

关键词:White and Cotterell, frequent grammatical properties, favor typologically frequent, typologically frequent grammatical, properties over rare

备注: EMNLP 2025 Main Conference

点击查看摘要

Abstract:Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked constructions, such as unbounded dependency and mildly context-sensitive structures. Second, our evaluation focuses more on the generalization ability of LMs to process unseen longer test sentences. Thus, our ALs better capture features of natural languages and our experimental paradigm leads to clearer conclusions -- typologically plausible word orders tend to be easier for LMs to productively generalize.

8. 【2510.12720】Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

链接https://arxiv.org/abs/2510.12720

作者:Ziyang Ma,Ruiyang Xu,Zhenghao Xing,Yunfei Chu,Yuxuan Wang,Jinzheng He,Jin Xu,Pheng-Ann Heng,Kai Yu,Junyang Lin,Eng Siong Chng,Xie Chen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:advancing human-AI interaction, Omni Language Models, human-AI interaction, information is critical, critical for advancing

备注: [this https URL](https://github.com/ddlBoJack/Omni-Captioner)

点击查看摘要

Abstract:Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

9. 【2510.12699】Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

链接https://arxiv.org/abs/2510.12699

作者:Sunny Yu,Ahmad Jabbar,Robert Hawkins,Dan Jurafsky,Myra Cheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:require different degrees, open-ended generation tasks, generation tasks require, open-ended generation, GSS

备注

点击查看摘要

Abstract:Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.

10. 【2510.12692】Who is a Better Matchmaker? Human vs. Algorithmic Judge Assignment in a High-Stakes Startup Competition

链接https://arxiv.org/abs/2510.12692

作者:Sarina Xi,Orelia Pi,Miaomiao Zhang,Becca Xiong,Jacqueline Ng Lane,Nihar B. Shah

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:applying artificial intelligence, complex decision-making tasks, artificial intelligence, growing interest, interest in applying

备注: 17 Pages, 2 figures

点击查看摘要

Abstract:There is growing interest in applying artificial intelligence (AI) to automate and support complex decision-making tasks. However, it remains unclear how algorithms compare to human judgment in contexts requiring semantic understanding and domain expertise. We examine this in the context of the judge assignment problem, matching submissions to suitably qualified judges. Specifically, we tackled this problem at the Harvard President's Innovation Challenge, the university's premier venture competition awarding over \$500,000 to student and alumni startups. This represents a real-world environment where high-quality judge assignment is essential. We developed an AI-based judge-assignment algorithm, Hybrid Lexical-Semantic Similarity Ensemble (HLSE), and deployed it at the competition. We then evaluated its performance against human expert assignments using blinded match-quality scores from judges on $309$ judge-venture pairs. Using a Mann-Whitney U statistic based test, we found no statistically significant difference in assignment quality between the two approaches ($AUC=0.48, p=0.40$); on average, algorithmic matches are rated $3.90$ and manual matches $3.94$ on a 5-point scale, where 5 indicates an excellent match. Furthermore, manual assignments that previously required a full week could be automated in several hours by the algorithm during deployment. These results demonstrate that HLSE achieves human-expert-level matching quality while offering greater scalability and efficiency, underscoring the potential of AI-driven solutions to support and enhance human decision-making for judge assignment in high-stakes settings.

11. 【2510.12680】Demystifying Hybrid Thinking: Can LLMs Truly Switch Between Think and No-Think?

链接https://arxiv.org/abs/2510.12680

作者:Shouren Wang,Wang Yang,Xianxuan Long,Qifan Wang,Vipin Chaudhary,Xiaotian Han

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Hybrid thinking enables, thinking enables LLMs, current hybrid thinking, hybrid thinking LLMs, Hybrid thinking

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Hybrid thinking enables LLMs to switch between reasoning and direct answering, offering a balance between efficiency and reasoning capability. Yet our experiments reveal that current hybrid thinking LLMs only achieve partial mode separation: reasoning behaviors often leak into the no-think mode. To understand and mitigate this, we analyze the factors influencing controllability and identify four that matter most: (1) larger data scale, (2) using think and no-think answers from different questions rather than the same question, (3) a moderate increase in no-think data number, and (4) a two-phase strategy that first trains reasoning ability and then applies hybrid think training. Building on these findings, we propose a practical recipe that, compared to standard training, can maintain accuracy in both modes while significantly reducing no-think output length (from $1085$ to $585$ on MATH500) and occurrences of reasoning-supportive tokens such as ``\texttt{wait}'' (from $5917$ to $522$ on MATH500). Our findings highlight the limitations of current hybrid thinking and offer directions for strengthening its controllability.

12. 【2510.12668】he Role of Parametric Injection-A Systematic Study of Parametric Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.12668

作者:Minghao Tang,Shiyu Ni,Jingtong Wu,Zengxin Han,Keping Bi

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Retrieval-augmented generation, parametric retrieval-augmented generation, large language models, enhances large language, retrieving external documents

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving external documents. As an emerging form of RAG, parametric retrieval-augmented generation (PRAG) encodes documents as model parameters (i.e., LoRA modules) and injects these representations into the model during inference, enabling interaction between the LLM and documents at parametric level. Compared with directly placing documents in the input context, PRAG is more efficient and has the potential to offer deeper model-document interaction. Despite its growing attention, the mechanism underlying parametric injection remains poorly understood. In this work, we present a systematic study of PRAG to clarify the role of parametric injection, showing that parameterized documents capture only partial semantic information of documents, and relying on them alone yields inferior performance compared to interaction at text level. However, these parametric representations encode high-level document information that can enhance the model's understanding of documents within the input context. When combined parameterized documents with textual documents, the model can leverage relevant information more effectively and become more robust to noisy inputs, achieving better performance than either source alone. We recommend jointly using parameterized and textual documents and advocate for increasing the information content of parametric representations to advance PRAG.

13. 【2510.12643】Reasoning Pattern Matters: Learning to Reason without Human Rationales

链接https://arxiv.org/abs/2510.12643

作者:Chaoxu Pang,Yixuan Cao,Ping Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:applies Reinforcement Learning, Large Language Models, performs Supervised Fine-Tuning, Large Language, Reinforcement Learning

备注: Submitted to Frontiers of Computer Science

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.

14. 【2510.12637】COSTAR-A: A prompting framework for enhancing Large Language Model performance on Point-of-View questions

链接https://arxiv.org/abs/2510.12637

作者:Nzubechukwu C. Ohalete(1),Kevin B. Gittner(1),Lauren M. Matheny(1) ((1) School of Data Science and Analytics, Kennesaw State University, GA, USA)

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, highly sensitive, techniques is crucial

备注: 20 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are highly sensitive to prompt design, and making optimized prompting techniques is crucial for generating consistent, high-quality outputs. In this study, we introduce COSTAR-A, a novel prompt engineering framework that enhances the existing COSTAR method, which stands for Context, Objective, Style, Tone, Audience, and Response, by adding the 'Answer' component at the end. We demonstrate that while the original COSTAR framework improves prompt clarity and aligns outputs for larger LLMs, its performance is less consistent with smaller, locally optimized models, particularly in tasks that require more directive or constrained outputs. Through a series of controlled prompt-output assessments with smaller (at most 8 billion parameters), fine-tuned models, we found that COSTAR-A can enhance the output structure and decisiveness of localized LLMs for certain tasks, although its effectiveness varies across models and use cases. Notably, the Llama 3.1-8B model exhibited performance improvements when prompted with COSTAR-A compared to COSTAR alone. These findings emphasize the adaptability and scalability of COSTAR-A as a prompting framework, particularly in computationally efficient AI deployments on resource-constrained hardware.

15. 【2510.12621】ACADATA: Parallel Dataset of Academic Data for Machine Translation

链接https://arxiv.org/abs/2510.12621

作者:Iñaki Lacunza,Javier Garcia Gilabert,Francesca De Luca Fornaciari,Javier Aula-Blasco,Aitor Gonzalez-Agirre,Maite Melero,Marta Villegas

类目:Computation and Language (cs.CL)

关键词:million author-generated paragraph, high-quality parallel dataset, author-generated paragraph pairs, curated evaluation set, present ACADATA

备注

点击查看摘要

Abstract:We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.

16. 【2510.12608】StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis

链接https://arxiv.org/abs/2510.12608

作者:Siyuan Li,Aodu Wulianghai,Xi Lin,Guangyan Li,Xiang Chen,Jun Wu,Jianhua Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, ensuring content authenticity, open-domain writing, authenticity and trust, increasing integration

备注

点击查看摘要

Abstract:With the increasing integration of large language models (LLMs) into open-domain writing, detecting machine-generated text has become a critical task for ensuring content authenticity and trust. Existing approaches rely on statistical discrepancies or model-specific heuristics to distinguish between LLM-generated and human-written text. However, these methods struggle in real-world scenarios due to limited generalization, vulnerability to paraphrasing, and lack of explainability, particularly when facing stylistic diversity or hybrid human-AI authorship. In this work, we propose StyleDecipher, a robust and explainable detection framework that revisits LLM-generated text detection using combined feature extractors to quantify stylistic differences. By jointly modeling discrete stylistic indicators and continuous stylistic representations derived from semantic embeddings, StyleDecipher captures distinctive style-level divergences between human and LLM outputs within a unified representation space. This framework enables accurate, explainable, and domain-agnostic detection without requiring access to model internals or labeled segments. Extensive experiments across five diverse domains, including news, code, essays, reviews, and academic abstracts, demonstrate that StyleDecipher consistently achieves state-of-the-art in-domain accuracy. Moreover, in cross-domain evaluations, it surpasses existing baselines by up to 36.30%, while maintaining robustness against adversarial perturbations and mixed human-AI content. Further qualitative and quantitative analysis confirms that stylistic signals provide explainable evidence for distinguishing machine-generated text. Our source code can be accessed at this https URL.

17. 【2510.12603】Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

链接https://arxiv.org/abs/2510.12603

作者:Chao Chen,Zhixin Ma,Yongqi Li,Yupeng Hu,Yinwei Wei,Wenjie Li,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:incorporating intermediate reasoning, reasoning, Multimodal reasoning aims, final answer, multimodal latent reasoning

备注

点击查看摘要

Abstract:Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at this https URL.

18. 【2510.12587】aching Language Models to Faithfully Express their Uncertainty

链接https://arxiv.org/abs/2510.12587

作者:Bryan Eikema,Evgenia Ilia,José G. C. de Souza,Chrysoula Zerva,Wilker Aziz

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, produce divergent answers, repeated queries, reflect this variability

备注

点击查看摘要

Abstract:Large language models (LLMs) often miscommunicate their uncertainty: repeated queries can produce divergent answers, yet generated responses are typically unhedged or hedged in ways that do not reflect this variability. This conveys unfaithful information about the uncertain state of the LLMs' knowledge, creating a faithfulness gap that affects even strong LLMs. We introduce Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches instruction-tuned LLMs to express uncertainty faithfully without altering their underlying answer distribution. We construct training data by augmenting model samples with uncertainty hedges (i.e. verbal cues such as 'possibly' or 'likely') aligned with sample consistency, requiring no supervision beyond the model and a set of prompts. We evaluate FUT on open-domain question answering (QA) across multiple models and datasets. Our results show that FUT substantially reduces the faithfulness gap, while preserving QA accuracy and introducing minimal semantic distribution shift. Further analyses demonstrate robustness across decoding strategies, choice of hedgers, and other forms of uncertainty expression (i.e. numerical). These findings establish FUT as a simple and effective way to teach LLMs to communicate uncertainty faithfully.

19. 【2510.12548】VISaGE: Understanding Visual Generics and Exceptions

链接https://arxiv.org/abs/2510.12548

作者:Stella Frank,Emily Allaway

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, Language Models, learn conceptual representations, generalized knowledge

备注: EMNLP 2025

点击查看摘要

Abstract:While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.

20. 【2510.12516】BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)

链接https://arxiv.org/abs/2510.12516

作者:Tomas Ruiz,Siyao Peng,Barbara Plank,Carsten Schwemmer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:performing extra computation, Test-time scaling, improve LLM outputs, extra computation, family of techniques

备注

点击查看摘要

Abstract:Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not. Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.

21. 【2510.12476】When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

链接https://arxiv.org/abs/2510.12476

作者:Lang Gao,Xuhui Li,Chenxi Wang,Mingzhe Li,Wei Liu,Zirui Song,Jinghui Zhang,Rui Yan,Preslav Nakov,Xiuying Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:imitating personal style, producing fluent text, language generation, producing fluent, personal style

备注

点击查看摘要

Abstract:Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.

22. 【2510.12474】SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

链接https://arxiv.org/abs/2510.12474

作者:Biao Zhang,Lixin Chen,Tong Liu,Bo Zheng

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, capture rich semantic, Large language, generate high-dimensional embeddings, Sequential Matryoshka Representation

备注: Accepted by EMNLP2025

点击查看摘要

Abstract:Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.

23. 【2510.12463】Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test

链接https://arxiv.org/abs/2510.12463

作者:Nikoleta Pantelidou,Evelina Leivada,Paolo Morosi

类目:Computation and Language (cs.CL)

关键词:Large Language Models, abilities of Large, Large Language, ongoing debate, matter of ongoing

备注

点击查看摘要

Abstract:The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the quantity of available training data. Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy. However, accuracy patterns align more closely with community size and data availability than with structural complexity, refining earlier claims in the literature. In particular, languages with larger speaker communities and stronger digital representation, such as Spanish and English, revealed higher accuracy than less-resourced ones like Catalan and Greek. Overall, our findings suggest that model behavior is mainly driven by the richness of linguistic resources rather than by sensitivity to grammatical complexity, reflecting a form of performance that resembles human linguistic competence only superficially.

24. 【2510.12460】Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.12460

作者:Linfeng Gao,Baolong Bi,Zheng Yuan,Le Wang,Zerui Chen,Zhimin Wei,Shenghua Liu,Qinggang Zhang,Jinsong Su

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Retrieval-Augmented Generation, factuality of Large, Language Models

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs). However, existing RAG systems often suffer from an unfaithfulness issue, where the model's response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning. These works treat the LLM as a black box and overlook a crucial question: how does the LLM internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in LLMs and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at this https URL.

25. 【2510.12434】PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.12434

作者:Xiangjun Zai,Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Wenjie Zhang

类目:Computation and Language (cs.CL)

关键词:model multi-entity relations, Knowledge Hypergraphs, Knowledge Hypergraphs framework, offering a paradigm, existing KH-based RAG

备注

点击查看摘要

Abstract:Knowledge Hypergraphs (KHs) have recently emerged as a knowledge representation for retrieval-augmented generation (RAG), offering a paradigm to model multi-entity relations into a structured form. However, existing KH-based RAG methods suffer from three major limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, which constrain their ability to perform effective multi-hop question answering. To overcome these limitations, we propose PRoH, a dynamic Planning and Reasoning over Knowledge Hypergraphs framework. PRoH incorporates three core innovations: (i) a context-aware planning module that sketches the local KH neighborhood to guide structurally grounded reasoning plan generation; (ii) a structured question decomposition process that organizes subquestions as a dynamically evolving Directed Acyclic Graph (DAG) to enable adaptive, multi-trajectory exploration; and (iii) an Entity-Weighted Overlap (EWO)-guided reasoning path retrieval algorithm that prioritizes semantically coherent hyperedge traversals. Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining strong robustness in long-range multi-hop reasoning tasks.

26. 【2510.12389】okenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

链接https://arxiv.org/abs/2510.12389

作者:Hailay Kidu Teklehaymanot,Wolfgang Nejdl

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:linguistically diverse populations, achieving equitable access, diverse populations, pose a significant, significant barrier

备注: 6 pages 4 figures

点击查看摘要

Abstract:Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.

27. 【2510.12367】LLM-REVal: Can We Trust LLM Reviewers Yet?

链接https://arxiv.org/abs/2510.12367

作者:Rui Li,Jia-Chen Gu,Po-Nien Kung,Heming Xia,Junfeng liu,Xiangwen Kong,Zhifang Sui,Nanyun Peng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, academic workflow, language models, potentially reshaping, practiced and reviewed

备注

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.

28. 【2510.12357】MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

链接https://arxiv.org/abs/2510.12357

作者:Yushu Zhao,Yubin Qin,Yang Wang,Xiaolong Yang,Huiming Han,Shaojun Wei,Yang Hu,Shouyi Yin

类目:Computation and Language (cs.CL)

关键词:recently demonstrated exceptional, demonstrated exceptional performance, range of applications, recently demonstrated, demonstrated exceptional

备注: Accepted to ASP-DAC 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with \textit{mixture of big-little experts}. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.

Comments:
Accepted to ASP-DAC 2026

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2510.12357 [cs.CL]

(or
arXiv:2510.12357v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.12357

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2510.12355】Fine-grained Analysis of Brain-LLM Alignment through Input Attribution

链接https://arxiv.org/abs/2510.12355

作者:Michela Proietti,Roberto Capobianco,Mariya Toneva

类目:Computation and Language (cs.CL)

关键词:computational principles underlying, reveal computational principles, principles underlying language, human brain activity, computational principles

备注

点击查看摘要

Abstract:Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. We introduce a fine-grained input attribution method to identify the specific words most important for brain-LLM alignment, and leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Our findings reveal that BA and NWP rely on largely distinct word subsets: NWP exhibits recency and primacy biases with a focus on syntax, while BA prioritizes semantic and discourse-level information with a more targeted recency effect. This work advances our understanding of how LLMs relate to human language processing and highlights differences in feature reliance between BA and NWP. Beyond this study, our attribution method can be broadly applied to explore the cognitive relevance of model predictions in diverse language processing tasks.

30. 【2510.12327】Simple Projection Variants Improve ColBERT Performance

链接https://arxiv.org/abs/2510.12327

作者:Benjamin Clavié,Sean Lee,Rikiya Takehi,Aamir Shakir,Makoto P. Kato

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multi-vector dense retrieval, dense retrieval methods, Multi-vector dense, reduce the dimensionality, projection

备注

点击查看摘要

Abstract:Multi-vector dense retrieval methods like ColBERT systematically use a single-layer linear projection to reduce the dimensionality of individual vectors. In this study, we explore the implications of the MaxSim operator on the gradient flows of the training of multi-vector models and show that such a simple linear projection has inherent, if non-critical, limitations in this setting. We then discuss the theoretical improvements that could result from replacing this single-layer projection with well-studied alternative feedforward linear networks (FFN), such as deeper, non-linear FFN blocks, GLU blocks, and skip-connections, could alleviate these limitations. Through the design and systematic evaluation of alternate projection blocks, we show that better-designed final projections positively impact the downstream performance of ColBERT models. We highlight that many projection variants outperform the original linear projections, with the best-performing variants increasing average performance on a range of retrieval benchmarks across domains by over 2 NDCG@10 points. We then conduct further exploration on the individual parameters of these projections block in order to understand what drives this empirical performance, highlighting the particular importance of upscaled intermediate projections and residual connections. As part of these ablation studies, we show that numerous suboptimal projection variants still outperform the traditional single-layer projection across multiple benchmarks, confirming our hypothesis. Finally, we observe that this effect is consistent across random seeds, further confirming that replacing the linear layer of ColBERT models is a robust, drop-in upgrade.

31. 【2510.12316】Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation

链接https://arxiv.org/abs/2510.12316

作者:Greta Damo,Elena Cabrio,Serena Villata

类目:Computation and Language (cs.CL)

关键词:counter harmful content, Large Language Models, Counter-speech generation, harmful content, model counter-speech generation

备注

点击查看摘要

Abstract:Counter-speech generation is at the core of many expert activities, such as fact-checking and hate speech, to counter harmful content. Yet, existing work treats counter-speech generation as pure text generation task, mainly based on Large Language Models or NGO experts. These approaches show severe drawbacks due to the limited reliability and coherence in the generated countering text, and in scalability, respectively. To close this gap, we introduce a novel framework to model counter-speech generation as knowledge-wise text generation process. Our framework integrates advanced Retrieval-Augmented Generation (RAG) pipelines to ensure the generation of trustworthy counter-speech for 8 main target groups identified in the hate speech literature, including women, people of colour, persons with disabilities, migrants, Muslims, Jews, LGBT persons, and other. We built a knowledge base over the United Nations Digital Library, EUR-Lex and the EU Agency for Fundamental Rights, comprising a total of 32,792 texts. We use the MultiTarget-CONAN dataset to empirically assess the quality of the generated counter-speech, both through standard metrics (i.e., JudgeLM) and a human evaluation. Results show that our framework outperforms standard LLM baselines and competitive approach, on both assessments. The resulting framework and the knowledge base pave the way for studying trustworthy and sound counter-speech generation, in hate speech and beyond.

32. 【2510.12306】A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

链接https://arxiv.org/abs/2510.12306

作者:Cameron Morin,Matti Marttinen Larsson

类目:Computation and Language (cs.CL)

关键词:significant methodological bottleneck, corpus linguistic work, manual annotation remains, language corpora expand, natural language corpora

备注

点击查看摘要

Abstract:As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.

33. 【2510.12287】Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector

链接https://arxiv.org/abs/2510.12287

作者:Sifan Li,Hongkai Chen,Yujun Cai,Qingwen Ye,Liyang Chen,Junsong Yuan,Yiwei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision Language Models, Vision Language, achieved impressive progress, Language Models, visual evidence

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.

34. 【2510.12285】Chinese ModernBERT with Whole-Word Masking

链接https://arxiv.org/abs/2510.12285

作者:Zeyu Zhao,Ningtao Wang,Xing Fu,Yu Cheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:yielding Pareto gains, Encoder-only Transformers, yielding Pareto, Transformers have advanced, Pareto gains

备注

点击查看摘要

Abstract:Encoder-only Transformers have advanced along three axes -- architecture, data, and systems -- yielding Pareto gains in accuracy, speed, and memory efficiency. Yet these improvements have not fully transferred to Chinese, where tokenization and morphology differ markedly from English. We introduce Chinese ModernBERT, a from-scratch Chinese encoder that couples: (i) a hardware-aware 32k BPE vocabulary tailored to frequent Chinese affixes/compounds, lowering the embedding budget; (ii) whole-word masking (WWM) with a dynamic masking curriculum (30% - 15%) to align task difficulty with training progress; (iii) a two-stage pre-training pipeline that extends the native context from 1,024 to 8,192 tokens using RoPE and alternating local/global attention; and (iv) a damped-cosine learning-rate schedule for stable long-horizon optimization. We pre-train on ~1.2T Chinese tokens from CCI3-HQ, CCI4 (Chinese), and Cosmopedia-Chinese. On CLUE, Chinese ModernBERT is competitive with strong Chinese encoders under a unified fine-tuning protocol. Under bf16 it achieves high long-sequence throughput while maintaining strong short-sequence speed, reflecting benefits from budget allocation and attention design. To probe retrieval-oriented quality, we add a small amount of open contrastive data: fine-tuning on SimCLUE (~3M pairs) improves further when adding T2Ranking (~2M), reaching 0.505 (Pearson) / 0.537 (Spearman) on the SimCLUE test set. Under this open-data setting, Chinese ModernBERT surpasses Qwen-0.6B-embedding on SimCLUE, suggesting a clear scaling path for STS with additional curated pairs. We will release tokenizer and weights to facilitate reproducible research.

35. 【2510.12255】Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs

链接https://arxiv.org/abs/2510.12255

作者:Blazej Manczak,Eric Lin,Francisco Eiras,James O' Neill,Vaikkunth Mugunthan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remains poorly understood, Large language models, interactions remains poorly, Large language, multi-turn interactions remains

备注: Dataset and code: [this https URL](https://huggingface.co/datasets/dynamoai-ml/MedQA-USMLE-4-MultiTurnRobust) ; [this https URL](https://github.com/bmanczak/MedQA-MultiTurnRobustness) Accepted as a poster at NeurIPS 2025 Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

点击查看摘要

Abstract:Large language models (LLMs) are rapidly transitioning into medical clinical use, yet their reliability under realistic, multi-turn interactions remains poorly understood. Existing evaluation frameworks typically assess single-turn question answering under idealized conditions, overlooking the complexities of medical consultations where conflicting input, misleading context, and authority influence are common. We introduce MedQA-Followup, a framework for systematically evaluating multi-turn robustness in medical question answering. Our approach distinguishes between shallow robustness (resisting misleading initial context) and deep robustness (maintaining accuracy when answers are challenged across turns), while also introducing an indirect-direct axis that separates contextual framing (indirect) from explicit suggestion (direct). Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs and find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings, with accuracy dropping from 91.2% to as low as 13.5% for Claude Sonnet 4. Counterintuitively, indirect, context-based interventions are often more harmful than direct suggestions, yielding larger accuracy drops across models and exposing a significant vulnerability for clinical deployment. Further compounding analyses reveal model differences, with some showing additional performance drops under repeated interventions while others partially recovering or even improving. These findings highlight multi-turn robustness as a critical but underexplored dimension for safe and reliable deployment of medical LLMs.

36. 【2510.12251】DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering

链接https://arxiv.org/abs/2510.12251

作者:Jiakai Li,Rongzheng Wang,Yizhuo Ma,Shuang Liang,Guangchun Luo,Ke Qin

类目:Computation and Language (cs.CL)

关键词:large language models, show considerable promise, multi-document question answering, handling multi-document question, language models

备注: 27 pages, has been accepted by NeurIPS 2025

点击查看摘要

Abstract:While large language models (LLMs) show considerable promise across various fields, they have notable limitations in handling multi-document question answering (Multi-doc QA) tasks. The first challenge is long-range dependency modeling, where LLMs struggle to focus on key information in long texts, which weakens important semantic connections. Second, most LLMs suffer from the ''lost-in-the-middle'' issue, where they have difficulty processing information in the middle of long inputs. Current solutions either truncate global dependencies or demand costly finetuning, ultimately lacking a universal and simple solution for these challenges. To resolve these limitations, we propose Dual-Stage Adaptive Sharpening (DSAS) containing two modules. (i) The Contextual Gate Weighting (CGW) module alleviates ''lost-in-the-middle'' by assessing paragraph relevance through layer-wise attention tracking and position-aware weighting. (ii) The Reciprocal Attention Suppression (RAS) module enhances focus on critical paragraphs by suppressing information exchange between key and irrelevant texts, thus mitigating the limitations in long-range dependency modeling. Notably, DSAS functions as a plug-and-play solution requiring no architectural modifications or extra training parameters. Extensive experiments on four benchmarks demonstrate DSAS's efficacy across mainstream LLMs (Llama, Qwen, Mistral, and Deepseek), with an average F1-score improvement of 4.2% in Multi-doc QA tasks on Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct. Ablation studies confirm the essential contributions of both the CGW and RAS modules. In addition, detailed discussions in the Appendix further validate the robustness and scalability of DSAS.

37. 【2510.12229】Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

链接https://arxiv.org/abs/2510.12229

作者:Bianca Raimondi,Daniela Dalbagno,Maurizio Gabbrielli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:manifest remain unclear, Large language models, Large language, internalize human-like biases, biases manifest remain

备注: Preprint. Under review

点击查看摘要

Abstract:Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.

38. 【2510.12217】HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

链接https://arxiv.org/abs/2510.12217

作者:Ali Mekky,Omar El Herraoui,Preslav Nakov,Yuxia Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, clinical decision support, increasingly deployed, deployed across high-impact

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.

39. 【2510.12200】HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

链接https://arxiv.org/abs/2510.12200

作者:Xiaoxue Ren,Penghao Jiang,Kaixin Li,Zhiyong Huang,Xiaoning Du,Jiaojiao Jiang,Zhenchang Xing,Jiamou Sun,Terry Yue Zhuo

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:sensitive data, prime targets, targets for cyberattacks, cyberattacks as gateways, gateways to critical

备注

点击查看摘要

Abstract:Web applications are prime targets for cyberattacks as gateways to critical services and sensitive data. Traditional penetration testing is costly and expertise-intensive, making it difficult to scale with the growing web ecosystem. While language model agents show promise in cybersecurity, modern web applications demand visual understanding, dynamic content handling, and multi-step interactions that only computer-use agents (CUAs) can perform. Yet, their ability to discover and exploit vulnerabilities through graphical interfaces remains largely unexplored. We present HackWorld, the first framework for systematically evaluating CUAs' capabilities to exploit web application vulnerabilities via visual interaction. Unlike sanitized benchmarks, HackWorld includes 36 real-world applications across 11 frameworks and 7 languages, featuring realistic flaws such as injection vulnerabilities, authentication bypasses, and unsafe input handling. Using a Capture-the-Flag (CTF) setup, it tests CUAs' capacity to identify and exploit these weaknesses while navigating complex web interfaces. Evaluation of state-of-the-art CUAs reveals concerning trends: exploitation rates below 12% and low cybersecurity awareness. CUAs often fail at multi-step attack planning and misuse security tools. These results expose the current limitations of CUAs in web security contexts and highlight opportunities for developing more security-aware agents capable of effective vulnerability detection and exploitation.

40. 【2510.12195】DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

链接https://arxiv.org/abs/2510.12195

作者:Zeyu Yang,Satoshi Nakamura

类目:Computation and Language (cs.CL)

关键词:requires accurate segmentation, speech translation requires, translation requires accurate, requires accurate, segmentation

备注

点击查看摘要

Abstract:Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.

41. 【2510.12185】Not in Sync: Unveiling Temporal Bias in Audio Chat Models

链接https://arxiv.org/abs/2510.12185

作者:Jiayu Yao,Shenghua Liu,Yiwei Wang,Rundong Cheng,Lingrui Mei,Baolong Bi,Zhen Xiong,Xueqi Cheng

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:Large Audio Language, occur remains underexplored, Audio Language Models, events occur remains, Large Audio

备注

点击查看摘要

Abstract:Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.

42. 【2510.12181】From Knowledge to Treatment: Large Language Model Assisted Biomedical Concept Representation for Drug Repurposing

链接https://arxiv.org/abs/2510.12181

作者:Chengrui Xiang,Tengfei Ma,Xiangzheng Fu,Yiping Liu,Bosheng Song,Xiangxiang Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:accelerating treatment discovery, Drug repurposing plays, plays a critical, critical role, role in accelerating

备注: 16 pages, 4 figures, 13 tables. Accepted by EMNLP 2025 (Findings)

点击查看摘要

Abstract:Drug repurposing plays a critical role in accelerating treatment discovery, especially for complex and rare diseases. Biomedical knowledge graphs (KGs), which encode rich clinical associations, have been widely adopted to support this task. However, existing methods largely overlook common-sense biomedical concept knowledge in real-world labs, such as mechanistic priors indicating that certain drugs are fundamentally incompatible with specific treatments. To address this gap, we propose LLaDR, a Large Language Model-assisted framework for Drug Repurposing, which improves the representation of biomedical concepts within KGs. Specifically, we extract semantically enriched treatment-related textual representations of biomedical entities from large language models (LLMs) and use them to fine-tune knowledge graph embedding (KGE) models. By injecting treatment-relevant knowledge into KGE, LLaDR largely improves the representation of biomedical concepts, enhancing semantic understanding of under-studied or complex indications. Experiments based on benchmarks demonstrate that LLaDR achieves state-of-the-art performance across different scenarios, with case studies on Alzheimer's disease further confirming its robustness and effectiveness. Code is available at this https URL.

43. 【2510.12178】Evolution of meta's llama models and parameter-efficient fine-tuning of large language models: a survey

链接https://arxiv.org/abs/2510.12178

作者:Abdulhady Abas Abdullah,Arkaitz Zubiaga,Seyedali Mirjalili,Amir H. Gandomi,Fatemeh Daneshfar,Mohammadsadra Amini,Alan Salam Mohammed,Hadi Veisi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Language Model Meta, Large Language Model, Model Meta, Large Language, specialized parameter-efficient fine-tuning

备注

点击查看摘要

Abstract:This review surveys the rapid evolution of Meta AI's LLaMA (Large Language Model Meta AI) series - from LLaMA 1 through LLaMA 4 and the specialized parameter-efficient fine-tuning (PEFT) methods developed for these models. We first describe the LLaMA family of foundation models (7B-65B to 288B parameters), their architectures (including native multimodal and Mixtureof-Experts variants), and key performance characteristics. We then describe and discuss the concept of PEFT, which adapts large pre-trained models by updating only a small subset of parameters, and review five PEFT methods that have been applied to LLaMA: LoRA (Low-Rank Adaptation), LLaMA-Adapter V1 and V2, LLaMA-Excitor, and QLoRA (Quantized LoRA). We discuss each method's mechanism, parameter savings, and example application to LLaMA (e.g., instruction tuning, multimodal tasks). We provide structured discussion and analysis of model and adapter architectures, parameter counts, and benchmark results (including examples where fine-tuned LLaMA models outperform larger baselines). Finally, we examine real-world use cases where LLaMA-based models and PEFT have been successfully applied (e.g., legal and medical domains), and we discuss ongoing challenges and future research directions (such as scaling to even larger contexts and improving robustness). This survey paper provides a one-stop resource for ML researchers and practitioners interested in LLaMA models and efficient fine-tuning strategies.

44. 【2510.12167】owards Inference-time Scaling for Continuous Space Reasoning

链接https://arxiv.org/abs/2510.12167

作者:Minghan Wang,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari

类目:Computation and Language (cs.CL)

关键词:combination with Process, large language models, re-ranking has proven, scaling through multiple, large language

备注

点击查看摘要

Abstract:Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnote{Our code and data will be publicly available.}

45. 【2510.12164】A Survey on Parallel Reasoning

链接https://arxiv.org/abs/2510.12164

作者:Ziqi Wang,Boye Niu,Zipeng Gao,Zhi Zheng,Tong Xu,Linghui Meng,Zhongli Li,Jing Liu,Yilong Chen,Chen Zhu,Hua Wu,Haifeng Wang,Enhong Chen

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, concurrently exploring multiple, exploring multiple lines

备注

点击查看摘要

Abstract:With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of LLM this http URL, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in this https URL.

46. 【2510.12137】Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models

链接https://arxiv.org/abs/2510.12137

作者:Shihao Ji,Zihui Song,Jiajie Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, generating factually incorrect, Transformer Softmax function, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions. We argue this stems from the Transformer's Softmax function, which creates "Artificial Certainty" by collapsing ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer. To fix this, we introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory. CAM produces a "credal set" (a set of distributions) instead of a single attention vector, with the set's size directly measuring model uncertainty. We implement this by re-conceptualizing attention scores as evidence masses for a Dirichlet distribution: sufficient evidence recovers standard attention, while insufficient evidence yields a diffuse distribution, representing ambiguity. Empirically, the Credal Transformer identifies out-of-distribution inputs, quantifies ambiguity, and significantly reduces confident errors on unanswerable questions by abstaining. Our contribution is a new architecture to mitigate hallucinations and a design paradigm that integrates uncertainty quantification directly into the model, providing a foundation for more reliable AI.

47. 【2510.12133】SafeMT: Multi-turn Safety for Multimodal Language Models

链接https://arxiv.org/abs/2510.12133

作者:Han Zhu,Juntao Dai,Jiaming Ji,Haoran Li,Chengkun Cai,Pengcheng Wen,Chi-Min Chan,Boyuan Chen,Yaodong Yang,Sirui Han,Yike Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:multi-modal Large Language, Large Language models, Large Language, multi-modal Large, Language models

备注

点击查看摘要

Abstract:With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.

48. 【2510.12121】Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

链接https://arxiv.org/abs/2510.12121

作者:Rongzhi Zhang,Liqin Ye,Yuzhao Heng,Xiang Chen,Tong Yu,Lingkai Kong,Sudheer Chava,Chao Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:generating Large Language, Large Language Model, Large Language, diverse user expectations, Precise attribute intensity

备注

点击查看摘要

Abstract:Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on this https URL

49. 【2510.12116】Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

链接https://arxiv.org/abs/2510.12116

作者:Bajian Xiang,Shuaijiang Zhao,Tingwei Guo,Wei Zou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Speech Language, Speech Language Models, Language Models, conversational generation abilities, semantic understanding benchmarks

备注: Accepted to EMNLP 2025 (Main Conference)

点击查看摘要

Abstract:End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.

50. 【2510.12115】racing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation

链接https://arxiv.org/abs/2510.12115

作者:Xin Zhao,Naoki Yoshinaga,Yuma Tsuta,Akiko Aizawa

类目:Computation and Language (cs.CL)

关键词:large language models, multilingual knowledge acquisition, Multilingual domain adaptation, language models, domain adaptation

备注: 22 Pages, Submitted to ARR 2025 Oct

点击查看摘要

Abstract:Multilingual domain adaptation (ML-DA) is widely used to learn new domain knowledge across languages into large language models (LLMs). Although many methods have been proposed to improve domain adaptation, the mechanisms of multilingual knowledge acquisition, how domain knowledge is learned within a language and transferred across languages, remain underexplored. This gap leads to suboptimal performance, particularly in low-resource settings. This work examines the learning dynamics of LLMs during ML-DA. Because prior ML-DA studies often train and evaluate on datasets with mismatched knowledge coverage, we propose AdaXEval, an adaptive evaluation method that builds multiple-choice QA datasets from the same bilingual domain corpus used for training, thereby directly studying multilingual knowledge acquisition. Through continual training of LLMs with diverse data recipes, we track how LLMs acquire domain facts and pinpoint the mechanism behind the transformation process from domain training data to knowledge. Our experiments on a 13B English-Japanese bilingual LLM reveal that cross-lingual transfer remains challenging despite a high-quality bilingual corpus. The code has been released.

51. 【2510.12110】Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

链接https://arxiv.org/abs/2510.12110

作者:Ziliang Qiu,Renfen Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:crucial research domain, LLMs' creativity represents, Parallel Association Chains, Chatbot Arena Creative, Arena Creative Writing

备注: 14 pages

点击查看摘要

Abstract:The evaluation of LLMs' creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman's $\rho = 0.739$, $p 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.

52. 【2510.12088】One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

链接https://arxiv.org/abs/2510.12088

作者:Zaid Khan,Archiki Prasad,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:modeling requires inferring, world modeling requires, executable program, modeling requires, requires inferring

备注: Project page: [this https URL](https://onelife-worldmodel.github.io/;) 39 pages

点击查看摘要

Abstract:Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

53. 【2510.12083】An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

链接https://arxiv.org/abs/2510.12083

作者:Benjamin W. Nelson,Celeste Wong,Matthew T. Silvestrini,Sooyoon Shin,Alanna Robinson,Jessica Lee,Eric Yang,John Torous,Andrew Trister

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Omni Moderation Latest, OpenAI Omni Moderation, Verily Mental Health, mishandle psychiatric emergencies

备注: Main Text: 2943; Abstract: 256; Tables and Figures: 5

点击查看摘要

Abstract:Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice and enabling destructive behaviors. This study evaluated the Verily behavioral health safety filter (VBHSF) on two datasets: the Verily Mental Health Crisis Dataset containing 1,800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. The two datasets were clinician-labelled and we evaluated performance using the clinician labels. Additionally, we carried out comparative performance analyses against two open source, content moderation guardrails: OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VBHSF demonstrated, well-balanced performance on the Verily Mental Health Crisis Dataset v1.0, achieving high sensitivity (0.990) and specificity (0.992) in detecting any mental health crises. It achieved an F1-score of 0.939, sensitivity ranged from 0.917-0.992, and specificity was = 0.978 in identifying specific crisis categories. When evaluated against the NVIDIA Aegis AI Content Safety Dataset 2.0, VBHSF performance remained highly sensitive (0.982) and accuracy (0.921) with reduced specificity (0.859). When compared with the NVIDIA NeMo and OpenAI Omni Moderation Latest guardrails, the VBHSF demonstrated superior performance metrics across both datasets, achieving significantly higher sensitivity in all cases (all p 0.001) and higher specificity relative to NVIDIA NeMo (p 0.001), but not to OpenAI Omni Moderation Latest (p = 0.094). NVIDIA NeMo and OpenAI Omni Moderation Latest exhibited inconsistent performance across specific crisis types, with sensitivity for some categories falling below 0.10. Overall, the VBHSF demonstrated robust, generalizable performance that prioritizes sensitivity to minimize missed crises, a crucial feature for healthcare applications.

54. 【2510.12063】hinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

链接https://arxiv.org/abs/2510.12063

作者:Sunzhu Li,Zhiyu Lin,Shuling Yang,Jiale Zhao,Wei Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Reasoning Models, Large Reasoning, suffer from inefficient, inefficient and off-target, Large

备注

点击查看摘要

Abstract:Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, which are instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot's broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (for example, cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7), and enhances instruction following. It also synergizes with existing training-based methods. Our analysis reveals that think-prefixes can reliably control LRMs' reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands. Data and code are available at this https URL

55. 【2510.12051】APCE: Adaptive Progressive Context Expansion for Long Context Processing

链接https://arxiv.org/abs/2510.12051

作者:Baisub Lee,Sanghyun Byun,Mohanad Odema,Jung Guack,Jacob Song,Woo Seong Chung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Long-Context Transformer Models, increasing context length, Transformer Models, transformer architecture performance, sequence length increases

备注: NeurIPS 2025 Workshop: ML For Systems

点击查看摘要

Abstract:Deploying useful Long-Context Transformer Models (LCTMs) requires addressing two key challenges: (1) A growing memory footprint due to quadratic self-attention and linear KV-cache scaling in memory as sequence length increases; (2) the ContextRot phenomena where empirical evidence suggests that transformer architecture's performance degrades with increasing context length. Given the shared dependency on the input, a natural question arises: Can we surgically select the most important input chunks for processing to synergistically (a) reduce the memory footprint, and (b) mitigate the ContextRot effects? In this paper, we answer this question in the affirmative for long-context summarization tasks. We propose APCE as a context-aware solution to select the most important input chunks through low-dimensional semantic similarity matching with the current query. By directly operating on the input, APCE decouples from strict dependency on underlying hardware or CUDA environments, promising a compatible solution scalable to different deployment systems. Our empirical evaluations have demonstrated superior or on-par summarization performance for APCE compared to the full dense baseline using a fraction (50%-70%) of the input sequence resulting in KV-cache and self-attention memory efficiency improvements. We hope our findings inspire further research on context-aware efficiency solutions for LCTMs geared towards other relevant long-context tasks.

56. 【2510.12044】Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

链接https://arxiv.org/abs/2510.12044

作者:Yukun Zhang,Qi Dong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Direct Preference Optimization, Large Language, Direct Preference, Existing alignment techniques

备注

点击查看摘要

Abstract:Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model's layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the "alignment tax" observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.

57. 【2510.12041】Improving Text-to-Image Generation with Input-Side Inference-Time Scaling

链接https://arxiv.org/abs/2510.12041

作者:Ruibo Chen,Jiacheng Pan,Heng Huang,Zhenheng Yang

类目:Computation and Language (cs.CL)

关键词:Recent advances, achieved impressive results, generation have achieved, leading to suboptimal, suboptimal image-text alignment

备注

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models often struggle with simple or underspecified prompts, leading to suboptimal image-text alignment, aesthetics, and quality. We propose a prompt rewriting framework that leverages large language models (LLMs) to refine user inputs before feeding them into T2I backbones. Our approach introduces a carefully designed reward system and an iterative direct preference optimization (DPO) training pipeline, enabling the rewriter to enhance prompts without requiring supervised fine-tuning data. We evaluate our method across diverse T2I models and benchmarks. Results show that our prompt rewriter consistently improves image-text alignment, visual quality, and aesthetics, outperforming strong baselines. Furthermore, we demonstrate strong transferability by showing that a prompt rewriter trained on one T2I backbone generalizes effectively to others without needing to be retrained. We also systematically study scalability, evaluating how performance gains scale with the capacity of the large LLM used as the rewriter. These findings highlight that prompt rewriting is an effective, scalable, and practical model-agnostic strategy for improving T2I systems. We plan to release the code and trained prompt rewriters soon.

58. 【2510.12040】Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions

链接https://arxiv.org/abs/2510.12040

作者:Sungmin Kang,Yavuz Faruk Bakman,Duygu Nur Yaldiz,Baturalp Buyukates,Salman Avestimehr

类目:Computation and Language (cs.CL)

关键词:natural language processing, including question answering, areas including question, large language models, language processing

备注: 24 pages, 3 figures, magazine

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has transformed the landscape of natural language processing, enabling breakthroughs across a wide range of areas including question answering, machine translation, and text summarization. Yet, their deployment in real-world applications has raised concerns over reliability and trustworthiness, as LLMs remain prone to hallucinations that produce plausible but factually incorrect outputs. Uncertainty quantification (UQ) has emerged as a central research direction to address this issue, offering principled measures for assessing the trustworthiness of model generations. We begin by introducing the foundations of UQ, from its formal definition to the traditional distinction between epistemic and aleatoric uncertainty, and then highlight how these concepts have been adapted to the context of LLMs. Building on this, we examine the role of UQ in hallucination detection, where quantifying uncertainty provides a mechanism for identifying unreliable generations and improving reliability. We systematically categorize a wide spectrum of existing methods along multiple dimensions and present empirical results for several representative approaches. Finally, we discuss current limitations and outline promising future research directions, providing a clearer picture of the current landscape of LLM UQ for hallucination detection.

59. 【2510.12036】On the Interplay between Human Label Variation and Model Fairness

链接https://arxiv.org/abs/2510.12036

作者:Kemal Kurniawan,Meladel Mistica,Timothy Baldwin,Jey Han Lau

类目:Computation and Language (cs.CL)

关键词:human label variation, unexplored topic, label variation, Abstract, HLV

备注: 9 pages, 7 figures

点击查看摘要

Abstract:The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness.

60. 【2510.12032】Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models

链接https://arxiv.org/abs/2510.12032

作者:Jung-Woo Shim,Yeong-Joon Ju,Ji-Hoon Park,Seong-Whan Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:shown strong performance, Recent advancements, natural language understanding, advancements in large, shown strong

备注: 22 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown strong performance in natural language understanding and generation tasks. However, LLMs continue to encounter challenges with hallucinations, where models generate plausible but incorrect information. While several factors contribute to hallucinations, the impact of ill-formed prompts, prompts with ambiguous wording, incorrect grammar, or incomplete information, was relatively under explored. To address this, we introduce Multi-stage Prompt Refinement (MPR), a framework designed to systematically improve these ill-formed prompts across multiple stages. Each stage addresses specific errors such as punctuation, typographical mistakes, and misuse of key terms, using small language models (SLMs) fine-tuned for these tasks. MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input. Experimental results on hallucination benchmarks show that prompts refined by MPR achieve over an 85~\% win rate compared to their original forms, demonstrating its effectiveness in reducing hallucinations and improving LLM output accuracy. Interestingly, we reveal that MPR can be combined with existing post-hoc hallucination mitigation frameworks, further enhancing its versatility. MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains.

61. 【2510.12029】CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement

链接https://arxiv.org/abs/2510.12029

作者:Jung-Woo Shim,Yeong-Joon Ju,Ji-Hoon Park,Seong-Whan Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Recent advancements, highlight their fluency, large language models, advancements in large, fluency in generating

备注: 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 7 pages, 2 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) highlight their fluency in generating responses to diverse prompts. However, these models sometimes generate plausible yet incorrect ``hallucinated" facts, undermining trust. A frequent but often overlooked cause of such errors is the use of poorly structured or vague prompts by users, leading LLMs to base responses on assumed rather than actual intentions. To mitigate hallucinations induced by these ill-formed prompts, we introduce Curative Prompt Refinement (CPR), a plug-and-play framework for curative prompt refinement that 1) cleans ill-formed prompts, and 2) generates additional informative task descriptions to align the intention of the user and the prompt using a fine-tuned small language model. When applied to language models, we discover that CPR significantly increases the quality of generation while also mitigating hallucination. Empirical studies show that prompts with CPR applied achieves over a 90\% win rate over the original prompts without any external knowledge.

62. 【2510.12023】Information Extraction from Conversation Transcripts: Neuro-Symbolic vs. LLM

链接https://arxiv.org/abs/2510.12023

作者:Alice Saebom Kwak,Maria Alexeeva,Gus Hahn-Powell,Keith Alcock,Kevin McLaughlin,Doug McCorkle,Gabe McNunn,Mihai Surdeanu

类目:Computation and Language (cs.CL)

关键词:effectively discarding decades, large language models, effectively discarding, current trend, rely extensively

备注: 15 pages, 2 figures

点击查看摘要

Abstract:The current trend in information extraction (IE) is to rely extensively on large language models, effectively discarding decades of experience in building symbolic or statistical IE systems. This paper compares a neuro-symbolic (NS) and an LLM-based IE system in the agricultural domain, evaluating them on nine interviews across pork, dairy, and crop subdomains. The LLM-based system outperforms the NS one (F1 total: 69.4 vs. 52.7; core: 63.0 vs. 47.2), where total includes all extracted information and core focuses on essential details. However, each system has trade-offs: the NS approach offers faster runtime, greater control, and high accuracy in context-free tasks but lacks generalizability, struggles with contextual nuances, and requires significant resources to develop and maintain. The LLM-based system achieves higher performance, faster deployment, and easier maintenance but has slower runtime, limited control, model dependency and hallucination risks. Our findings highlight the "hidden cost" of deploying NLP systems in real-world applications, emphasizing the need to balance performance, efficiency, and control.

63. 【2510.12001】Generate Logical Equivalence Questions

链接https://arxiv.org/abs/2510.12001

作者:Xinyu Wang,Haoming Yu,Yicheng Yang,Zhiyuan Li

类目:Computation and Language (cs.CL)

关键词:Academic dishonesty, questions, AQG, higher education, teaching and learning

备注

点击查看摘要

Abstract:Academic dishonesty is met with zero tolerance in higher education, yet plagiarism has become increasingly prevalent in the era of online teaching and learning. Automatic Question Generation (AQG) presents a potential solution to mitigate copying by creating unique questions for each student. Additionally, AQG can provide a vast array of practice questions. Our AQG focuses on generating logical equivalence questions for Discrete Mathematics, a foundational course for first-year computer science students. A literature review reveals that existing AQGs for this type of question generate all propositions that meet user-defined constraints, resulting in inefficiencies and a lack of uniform question difficulty. To address this, we propose a new approach that defines logical equivalence questions using a formal language, translates this language into two sets of generation rules, and develops a linear-time algorithm for question generation. We evaluated our AQG through two experiments. The first involved a group of students completing questions generated by our system. Statistical analysis shows that the accuracy of these questions is comparable to that of textbook questions. The second experiment assessed the number of steps required to solve our generated questions, textbook questions, and those generated by multiple large language models. The results indicated that the difficulty of our questions was similar to that of textbook questions, confirming the quality of our AQG.

64. 【2510.12000】UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

链接https://arxiv.org/abs/2510.12000

作者:Jinchuan Tian,Sang-gil Lee,Zhifeng Kong,Sreyan Ghosh,Arushi Goel,Chao-Han Huck Yang,Wenliang Dai,Zihan Liu,Hanrong Ye,Shinji Watanabe,Mohammad Shoeybi,Bryan Catanzaro,Rafael Valle,Wei Ping

类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent advances, domain tackle audio, audio language modeling, domain tackle, audio

备注

点击查看摘要

Abstract:Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

65. 【2510.11997】SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

链接https://arxiv.org/abs/2510.11997

作者:Ryan Shea,Yunan Lu,Liang Qiu,Zhou Yu

类目:Computation and Language (cs.CL)

关键词:Evaluating multi-turn interactive, Evaluating multi-turn, human assessment, challenging due, Evaluating

备注

点击查看摘要

Abstract:Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

66. 【2510.11986】Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

链接https://arxiv.org/abs/2510.11986

作者:Jasivan Alex Sivakumar,Philipp Borchert,Ronald Cardenas,Gerasimos Lampouras

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:direct translation process, expressing informal mathematical, informal mathematical statements, translation process, expressing informal

备注

点击查看摘要

Abstract:Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.

67. 【2510.11977】Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

链接https://arxiv.org/abs/2510.11977

作者:Sayash Kapoor,Benedikt Stroebl,Peter Kirgis,Nitya Nadgir,Zachary S Siegel,Boyi Wei,Tianci Xue,Ziru Chen,Felix Chen,Saiteja Utpala,Franck Ndzomga,Dheeraj Oruganty,Sophie Luskin,Kangheng Liu,Botao Yu,Amit Arora,Dongyoon Hahm,Harsh Trivedi,Huan Sun,Juyong Lee,Tengjun Jin,Yifan Mai,Yifei Zhou,Yuxuan Zhu,Rishi Bommasani,Daniel Kang,Dawn Song,Peter Henderson,Yu Su,Percy Liang,Arvind Narayanan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:complex real-world tasks, developed for complex, complex real-world, Holistic Agent Leaderboard, agent

备注

点击查看摘要

Abstract:AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

68. 【2510.11967】Scaling Long-Horizon LLM Agent via Context-Folding

链接https://arxiv.org/abs/2510.11967

作者:Weiwei Sun,Miao Lu,Zhan Ling,Kang Liu,Xuesong Yao,Yiming Yang,Jiecao Chen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language model, Large language, fundamentally constrained, LLM, Large

备注

点击查看摘要

Abstract:Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.

69. 【2510.11958】Direct Multi-Token Decoding

链接https://arxiv.org/abs/2510.11958

作者:Xuan Luo,Weizhi Wang,Xifeng Yan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Decoder-only transformers, large language models, middle layers, standard architecture, architecture for large

备注

点击查看摘要

Abstract:Decoder-only transformers have become the standard architecture for large language models (LLMs) due to their strong performance. Recent studies suggest that, in pre-trained LLMs, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.

70. 【2510.11956】Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

链接https://arxiv.org/abs/2510.11956

作者:Gabrielle Kaili-May Liu,Bryan Li,Arman Cohan,William Gantt Walden,Eugene Yang

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Real-world use cases, RAG systems, RAG, underline, relevant information

备注

点击查看摘要

Abstract:Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

71. 【2510.11952】GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences

链接https://arxiv.org/abs/2510.11952

作者:Priyanka Dey,Daniele Rosa,Wenqing Zheng,Daniel Barcklow,Jieyu Zhao,Emilio Ferrara

类目:Computation and Language (cs.CL)

关键词:deeper user attributes, neglecting deeper user, interaction logs, limiting scalability, costly human feedback

备注

点击查看摘要

Abstract:Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users' interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks -- including Hofstede's cultural dimensions, Schwartz's basic values, the World Values Survey, and Big Five OCEAN traits -- GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.

72. 【2510.11944】opoAlign: A Framework for Aligning Code to Math via Topological Decomposition

链接https://arxiv.org/abs/2510.11944

作者:Yupei Li,Philipp Borchert,Gerasimos Lampouras

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Math LLMs, task of transforming, formal statements

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

73. 【2510.11928】Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

链接https://arxiv.org/abs/2510.11928

作者:Lorena Calvo-Bartolomé,Valérie Aldana,Karla Cantarero,Alonso Madroñal de Mesa,Jerónimo Arenas-García,Jordan Boyd-Graber

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:ensure factual consistency, Multilingual question answering, consistency across languages, subjective responses, objective queries

备注: Long paper accepted at EMNLP 2025

点击查看摘要

Abstract:Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

74. 【2510.11919】LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

链接https://arxiv.org/abs/2510.11919

作者:Armel Zebaze,Rachel Bawden,Benoît Sagot

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, Large reasoning, thought process prior, terms of problem-solving, answering a query

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.

75. 【2510.11905】LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

链接https://arxiv.org/abs/2510.11905

作者:Patrick Haller,Mark Ibrahim,Polina Kirichenko,Levent Sagun,Samuel J. Bell

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, diverse settings, generally applied

备注

点击查看摘要

Abstract:For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings -- often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness -- i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement's exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.

76. 【2510.11892】R-WoM: Retrieval-augmented World Model For Computer-use Agents

链接https://arxiv.org/abs/2510.11892

作者:Kai Mei,Jiang Guo,Shuaichen Chang,Mingwen Dong,Dongkyu Lee,Xing Niu,Jiarong Jiang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, potentially eliminating costly, predicting action outcomes, enhance agent decision-making

备注

点击查看摘要

Abstract:Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

77. 【2510.11851】Deep Research Brings Deeper Harm

链接https://arxiv.org/abs/2510.11851

作者:Shuo Chen,Zonggen Li,Zhen Han,Bailan He,Tong Liu,Haokun Chen,Georg Groh,Philip Torr,Volker Tresp,Jindong Gu

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, retrieving online information, built on Large

备注: Accepted to Reliable ML from Unreliable Data Workshop @ NeurIPS 2025

点击查看摘要

Abstract:Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent's plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at this https URL.

78. 【2510.11842】Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities

链接https://arxiv.org/abs/2510.11842

作者:Urs Spiegelhalter,Jörg K.H. Franke,Frank Hutter

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:continued pretraining faces, avoiding catastrophic forgetting, Adapting language models, fundamental trade-off, Adapting language

备注: Presented at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Continual and Compatible Foundation Model Updates (CCFM)

点击查看摘要

Abstract:Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.

79. 【2510.11835】Data or Language Supervision: What Makes CLIP Better than DINO?

链接https://arxiv.org/abs/2510.11835

作者:Yiming Liu,Yuhui Zhang,Dhruba Ghosh,Ludwig Schmidt,Serena Yeung-Levy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:larger training data, outperforms self-supervised models, CLIP outperforms self-supervised, self-supervised models, vision-language models

备注: EMNLP 2025 Findings

点击查看摘要

Abstract:CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

80. 【2510.11834】Don't Walk the Line: Boundary Guidance for Filtered Generation

链接https://arxiv.org/abs/2510.11834

作者:Sarah Ball,Andreas Haupt

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Generative models, increasingly paired, filter harmful, harmful or undesirable, Boundary Guidance

备注: 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

81. 【2510.11813】ask-Aware Reduction for Scalable LLM-Database Systems

链接https://arxiv.org/abs/2510.11813

作者:Marcus Emmanuel Barnes,Taher A. Ghaleb,Safwat Hassan

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Databases (cs.DB)

关键词:Large Language Models, Large Language, developer observability, increasingly applied, querying to developer

备注: Preprint. Accepted for presentation at the Workshop on Language Models and Databases (LMD), co-located with CASCON 2025 (IEEE). The final version will appear in IEEE Xplore

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied to data-intensive workflows, from database querying to developer observability. Yet the effectiveness of these systems is constrained by the volume, verbosity, and noise of real-world text-rich data such as logs, telemetry, and monitoring streams. Feeding such data directly into LLMs is costly, environmentally unsustainable, and often misaligned with task objectives. Parallel efforts in LLM efficiency have focused on model- or architecture-level optimizations, but the challenge of reducing upstream input verbosity remains underexplored. In this paper, we argue for treating the token budget of an LLM as an attention budget and elevating task-aware text reduction as a first-class design principle for language -- data systems. We position input-side reduction not as compression, but as attention allocation: prioritizing information most relevant to downstream tasks. We outline open research challenges for building benchmarks, designing adaptive reduction pipelines, and integrating token-budget--aware preprocessing into database and retrieval systems. Our vision is to channel scarce attention resources toward meaningful signals in noisy, data-intensive workflows, enabling scalable, accurate, and sustainable LLM--data integration.

82. 【2510.11812】PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

链接https://arxiv.org/abs/2510.11812

作者:Souradeep Mukhopadhyay,Rishabh Baral,Nimeesh Mahajan,Samhitha Harish,Aswin RRV,Mihir Parmar,Mutsumi Nakamura,Chitta Baral

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, solving classic logic, genuine reasoning underlies, underlies their answers

备注: 22 Pages

点击查看摘要

Abstract:Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.

83. 【2510.11746】Evolution of wartime discourse on Telegram: A comparative study of Ukrainian and Russian policymakers' communication before and after Russia's full-scale invasion of Ukraine

链接https://arxiv.org/abs/2510.11746

作者:Mykola Makhortykh,Aytalina Kulichkina,Kateryna Maikovska

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:social media era, study examines elite-driven, large-scale European war, large-scale European, examines elite-driven political

备注: 46 pages

点击查看摘要

Abstract:This study examines elite-driven political communication on Telegram during the ongoing Russo-Ukrainian war, the first large-scale European war in the social media era. Using a unique dataset of Telegram public posts from Ukrainian and Russian policymakers (2019-2024), we analyze changes in communication volume, thematic content, and actor engagement following Russia's 2022 full-scale invasion. Our findings show a sharp increase in Telegram activity after the invasion, particularly among ruling-party policymakers. Ukrainian policymakers initially focused on war-related topics, but this emphasis declined over time In contrast, Russian policymakers largely avoided war-related discussions, instead emphasizing unrelated topics, such as Western crises, to distract public attention. We also identify differences in communication strategies between large and small parties, as well as individual policymakers. Our findings shed light on how policymakers adapt to wartime communication challenges and offer critical insights into the dynamics of online political discourse during times of war.

84. 【2510.11739】Celebrity Profiling on Short Urdu Text using Twitter Followers' Feed

链接https://arxiv.org/abs/2510.11739

作者:Muhammad Hamza,Rizwan Jafar

类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:essential part, Social media, information sharing, Urdu, Convolutional Neural Networks

备注

点击查看摘要

Abstract:Social media has become an essential part of the digital age, serving as a platform for communication, interaction, and information sharing. Celebrities are among the most active users and often reveal aspects of their personal and professional lives through online posts. Platforms such as Twitter provide an opportunity to analyze language and behavior for understanding demographic and social patterns. Since followers frequently share linguistic traits and interests with the celebrities they follow, textual data from followers can be used to predict celebrity demographics. However, most existing research in this field has focused on English and other high-resource languages, leaving Urdu largely unexplored. This study applies modern machine learning and deep learning techniques to the problem of celebrity profiling in Urdu. A dataset of short Urdu tweets from followers of subcontinent celebrities was collected and preprocessed. Multiple algorithms were trained and compared, including Logistic Regression, Support Vector Machines, Random Forests, Convolutional Neural Networks, and Long Short-Term Memory networks. The models were evaluated using accuracy, precision, recall, F1-score, and cumulative rank (cRank). The best performance was achieved for gender prediction with a cRank of 0.65 and an accuracy of 0.65, followed by moderate results for age, profession, and fame prediction. These results demonstrate that follower-based linguistic features can be effectively leveraged using machine learning and neural approaches for demographic prediction in Urdu, a low-resource language.

Subjects:

Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2510.11739 [cs.SI]

(or
arXiv:2510.11739v1 [cs.SI] for this version)

https://doi.org/10.48550/arXiv.2510.11739

Focus to learn more

              arXiv-issued DOI via DataCite</p>
85. 【2510.11734】Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need

链接https://arxiv.org/abs/2510.11734

作者:Yuqi Bai,Tianyu Huang,Kun Sun,Yuting Chen

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:emulate human personality, exploring their ability, ability to emulate, simulate social experiments, large language models

备注

点击查看摘要

Abstract:This research focuses on using large language models (LLMs) to simulate social experiments, exploring their ability to emulate human personality in virtual persona role-playing. The research develops an end-to-end evaluation framework, including individual-level analysis of stability and identifiability, as well as population-level analysis called progressive personality curves to examine the veracity and consistency of LLMs in simulating human personality. Methodologically, this research proposes important modifications to traditional psychometric approaches (CFA and construct validity) which are unable to capture improvement trends in LLMs at their current low-level simulation, potentially leading to remature rejection or methodological misalignment. The main contributions of this research are: proposing a systematic framework for LLM virtual personality evaluation; empirically demonstrating the critical role of persona detail in personality simulation quality; and identifying marginal utility effects of persona profiles, especially a Scaling Law in LLM personality simulation, offering operational evaluation metrics and a theoretical foundation for applying large language models in social science experiments.

86. 【2510.11584】LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings

链接https://arxiv.org/abs/2510.11584

作者:Ting Li,Yang Yang,Yipeng Yu,Liang Yao,Guoqing Chao,Ruifeng Xu

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:aim to disrupt, Adversarial attacks, ability of link, link prediction, prediction by removing

备注: 13 pages

点击查看摘要

Abstract:Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the model's ability of link prediction by removing or inserting triples. A recent black-box method has attempted to incorporate textual and structural information to enhance attack performance. However, it is unable to generate human-readable explanations, and exhibits poor generalizability. In the past few years, large language models (LLMs) have demonstrated powerful capabilities in text comprehension, generation, and reasoning. In this paper, we propose LLMAtKGE, a novel LLM-based framework that selects attack targets and generates human-readable explanations. To provide the LLM with sufficient factual context under limited input constraints, we design a structured prompting scheme that explicitly formulates the attack as multiple-choice questions while incorporating KG factual evidence. To address the context-window limitation and hesitation issues, we introduce semantics-based and centrality-based filters, which compress the candidate set while preserving high recall of attack-relevant information. Furthermore, to efficiently integrate both semantic and structural information into the filter, we precompute high-order adjacency and fine-tune the LLM with a triple classification task to enhance filtering performance. Experiments on two widely used knowledge graph datasets demonstrate that our attack outperforms the strongest black-box baselines and provides explanations via reasoning, and showing competitive performance compared with white-box methods. Comprehensive ablation and case studies further validate its capability to generate explanations.

87. 【2510.10681】RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

链接https://arxiv.org/abs/2510.10681

作者:Zichun Yu,Chenyan Xiong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, large language, reserves are running, running low, low for frontier

备注

点击查看摘要

Abstract:High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at this https URL.

88. 【2510.09011】ripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

链接https://arxiv.org/abs/2510.09011

作者:Yincen Qu,Huan Xiao,Feng Li,Gregory Li,Hui Zhou,Xiangying Dai

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:poses significant challenges, advanced large language, large language models, valuable yet complex, complex task

备注

点击查看摘要

Abstract:Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs' planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

89. 【2510.12210】DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

链接https://arxiv.org/abs/2510.12210

作者:Yakun Song,Xiaobin Zhuang,Jiawei Chen,Zhikang Niu,Guanrou Yang,Chenpeng Du,Zhuo Chen,Yuping Wang,Yuxuan Wang,Xie Chen

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:continuous speech representations, offer limited levers, Recent attempts, interleave autoregressive, sketchers with diffusion-based

备注

点击查看摘要

Abstract:Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on this https URL.

信息检索

1. 【2510.12801】DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

链接https://arxiv.org/abs/2510.12801

作者:Kartik Narayan,Yang Xu,Tian Cao,Kavya Nerella,Vishal M. Patel,Navid Shiee,Peter Grasch,Chao Jia,Yinfei Yang,Zhe Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Large Language Models, Multimodal Large Language, Large Language, applications require access, external knowledge sources

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

2. 【2510.12742】CTRL-Rec: Controlling Recommender Systems With Natural Language

链接https://arxiv.org/abs/2510.12742

作者:Micah Carroll,Adeline Foote,Kevin Feng,Marcus Williams,Anca Dragan,W. Bradley Knox,Smitha Milli

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:traditional recommender systems, recommender systems, language requests, language, users

备注

点击查看摘要

Abstract:When users are dissatisfied with recommendations from a recommender system, they often lack fine-grained controls for changing them. Large language models (LLMs) offer a solution by allowing users to guide their recommendations through natural language requests (e.g., "I want to see respectful posts with a different perspective than mine"). We propose a method, CTRL-Rec, that allows for natural language control of traditional recommender systems in real-time with computational efficiency. Specifically, at training time, we use an LLM to simulate whether users would approve of items based on their language requests, and we train embedding models that approximate such simulated judgments. We then integrate these user-request-based predictions into the standard weighting of signals that traditional recommender systems optimize. At deployment time, we require only a single LLM embedding computation per user request, allowing for real-time control of recommendations. In experiments with the MovieLens dataset, our method consistently allows for fine-grained control across a diversity of requests. In a study with 19 Letterboxd users, we find that CTRL-Rec was positively received by users and significantly enhanced users' sense of control and satisfaction with recommendations compared to traditional controls.

3. 【2510.12709】SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

链接https://arxiv.org/abs/2510.12709

作者:Lin Lin,Jiefeng Long,Zhihe Wan,Yuchi Wang,Dingkang Yang,Shuang Yang,Yueyang Yao,Xu Chen,Zirui Guo,Shengqiang Li,Weiran Li,Hanyu Li,Yaling Mou,Yan Qiu,Haiyang Yu,Xiao Liang,Hongsheng Li,Chao Feng

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:informative unified representations, yield informative unified, informative unified, empower diverse cross-modal, training

备注: Technical Report

点击查看摘要

Abstract:Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.

4. 【2510.12668】he Role of Parametric Injection-A Systematic Study of Parametric Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.12668

作者:Minghao Tang,Shiyu Ni,Jingtong Wu,Zengxin Han,Keping Bi

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Retrieval-augmented generation, parametric retrieval-augmented generation, large language models, enhances large language, retrieving external documents

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving external documents. As an emerging form of RAG, parametric retrieval-augmented generation (PRAG) encodes documents as model parameters (i.e., LoRA modules) and injects these representations into the model during inference, enabling interaction between the LLM and documents at parametric level. Compared with directly placing documents in the input context, PRAG is more efficient and has the potential to offer deeper model-document interaction. Despite its growing attention, the mechanism underlying parametric injection remains poorly understood. In this work, we present a systematic study of PRAG to clarify the role of parametric injection, showing that parameterized documents capture only partial semantic information of documents, and relying on them alone yields inferior performance compared to interaction at text level. However, these parametric representations encode high-level document information that can enhance the model's understanding of documents within the input context. When combined parameterized documents with textual documents, the model can leverage relevant information more effectively and become more robust to noisy inputs, achieving better performance than either source alone. We recommend jointly using parameterized and textual documents and advocate for increasing the information content of parametric representations to advance PRAG.

5. 【2510.12604】SMILE: SeMantic Ids Enhanced CoLd Item Representation for Click-through Rate Prediction in E-commerce SEarch

链接https://arxiv.org/abs/2510.12604

作者:Qihang Zhao,Zhongbo Sun,Xiaoyang Zheng,Xian Guo,Siyuan Wang,Zihan Liang,Mingcan Peng,Ben Chen,Chenyi Lei

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:challenging platform diversity, exacerbates the Matthew, Matthew effect, insufficient collaborative information, collaborative information

备注

点击查看摘要

Abstract:With the rise of modern search and recommendation platforms, insufficient collaborative information of cold-start items exacerbates the Matthew effect of existing platform items, challenging platform diversity and becoming a longstanding issue. Existing methods align items' side content with collaborative information to transfer collaborative signals from high-popularity items to cold-start items. However, these methods fail to account for the asymmetry between collaboration and content, nor the fine-grained differences among items. To address these issues, we propose SMILE, an item representation enhancement approach based on fused alignment of semantic IDs. Specifically, we use RQ-OPQ encoding to quantize item content and collaborative information, followed by a two-step alignment: RQ encoding transfers shared collaborative signals across items, while OPQ encoding learns differentiated information of items. Comprehensive offline experiments on large-scale industrial datasets demonstrate superiority of SMILE, and rigorous online A/B tests confirm statistically significant improvements: item CTR +1.66%, buyers +1.57%, and order volume +2.17%.

6. 【2510.12461】Leveraging Language Semantics for Collaborative Filtering with TextGCN and TextGCN-MLP: Zero-Shot vs In-Domain Performance

链接https://arxiv.org/abs/2510.12461

作者:Andrei Chernov,Haroon Wahab,Oleg Novitskij

类目:Information Retrieval (cs.IR)

关键词:incorporating textual information, leverage large language, recent years, recommender systems, large language models

备注

点击查看摘要

Abstract:In recent years, various approaches have been proposed to leverage large language models (LLMs) for incorporating textual information about items into recommender systems. Existing methods primarily focus on either fine-tuning LLMs to generate recommendations or integrating LLM-based embeddings into downstream models. In this work, we follow the latter direction and propose \textbf{TextGCN}, which applies parameter-free graph convolution layers directly over LLM-based item-title embeddings, instead of learning ID-based embeddings as in traditional methods. By combining language semantics with graph message passing, this architecture achieves state-of-the-art zero-shot performance, significantly outperforming prior approaches. Furthermore, we introduce \textbf{TextGCN-MLP}, which extends TextGCN with a trainable multilayer perceptron trained using a contrastive loss, achieving state-of-the-art in-domain performance on recommendation benchmarks. However, the zero-shot performance of TextGCN-MLP remains lower than that of TextGCN, highlighting the trade-off between in-domain specialization and zero-shot generalization. We release our code on github at \href{this https URL}{this http URL}.

7. 【2510.12369】A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning

链接https://arxiv.org/abs/2510.12369

作者:Yang Xiang,Li Fan,Chenke Yin,Chengtao Ji

类目:Information Retrieval (cs.IR)

关键词:vision foundation models, discrete token interfaces, transform complex inputs, Recent progress, foundation models demonstrates

备注

点击查看摘要

Abstract:Recent progress in language and vision foundation models demonstrates the importance of discrete token interfaces that transform complex inputs into compact sequences for large-scale modeling. Extending this paradigm to graphs requires a tokenization scheme that handles non-Euclidean structures and multi-scale dependencies efficiently. Existing approaches to graph tokenization, linearized, continuous, and quantized, remain limited in adaptability and efficiency. In particular, most current quantization-based tokenizers organize hierarchical information in fixed or task-agnostic ways, which may either over-represent or under-utilize structural cues, and lack the ability to dynamically reweight contributions from different levels without retraining the encoder. This work presents a hierarchical quantization framework that introduces a self-weighted mechanism for task-adaptive aggregation across multiple scales. The proposed method maintains a frozen encoder while modulating information flow through a lightweight gating process, enabling parameter-efficient adaptation to diverse downstream tasks. Experiments on benchmark datasets for node classification and link prediction demonstrate consistent improvements over strong baselines under comparable computational budgets.

8. 【2510.12327】Simple Projection Variants Improve ColBERT Performance

链接https://arxiv.org/abs/2510.12327

作者:Benjamin Clavié,Sean Lee,Rikiya Takehi,Aamir Shakir,Makoto P. Kato

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multi-vector dense retrieval, dense retrieval methods, Multi-vector dense, reduce the dimensionality, projection

备注

点击查看摘要

Abstract:Multi-vector dense retrieval methods like ColBERT systematically use a single-layer linear projection to reduce the dimensionality of individual vectors. In this study, we explore the implications of the MaxSim operator on the gradient flows of the training of multi-vector models and show that such a simple linear projection has inherent, if non-critical, limitations in this setting. We then discuss the theoretical improvements that could result from replacing this single-layer projection with well-studied alternative feedforward linear networks (FFN), such as deeper, non-linear FFN blocks, GLU blocks, and skip-connections, could alleviate these limitations. Through the design and systematic evaluation of alternate projection blocks, we show that better-designed final projections positively impact the downstream performance of ColBERT models. We highlight that many projection variants outperform the original linear projections, with the best-performing variants increasing average performance on a range of retrieval benchmarks across domains by over 2 NDCG@10 points. We then conduct further exploration on the individual parameters of these projections block in order to understand what drives this empirical performance, highlighting the particular importance of upscaled intermediate projections and residual connections. As part of these ablation studies, we show that numerous suboptimal projection variants still outperform the traditional single-layer projection across multiple benchmarks, confirming our hypothesis. Finally, we observe that this effect is consistent across random seeds, further confirming that replacing the linear layer of ColBERT models is a robust, drop-in upgrade.

9. 【2510.12325】Causal Inspired Multi Modal Recommendation

链接https://arxiv.org/abs/2510.12325

作者:Jie Yang,Chenyang Gu,Zixuan Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:recommender systems enhance, systems enhance personalized, user-item interaction data, Multimodal recommender systems, enhance personalized recommendations

备注

点击查看摘要

Abstract:Multimodal recommender systems enhance personalized recommendations in e-commerce and online advertising by integrating visual, textual, and user-item interaction data. However, existing methods often overlook two critical biases: (i) modal confounding, where latent factors (e.g., brand style or product category) simultaneously drive multiple modalities and influence user preference, leading to spurious feature-preference associations; (ii) interaction bias, where genuine user preferences are mixed with noise from exposure effects and accidental clicks. To address these challenges, we propose a Causal-inspired multimodal Recommendation framework. Specifically, we introduce a dual-channel cross-modal diffusion module to identify hidden modal confounders, utilize back-door adjustment with hierarchical matching and vector-quantized codebooks to block confounding paths, and apply front-door adjustment combined with causal topology reconstruction to build a deconfounded causal subgraph. Extensive experiments on three real-world e-commerce datasets demonstrate that our method significantly outperforms state-of-the-art baselines while maintaining strong interpretability.

10. 【2510.12299】An Empirical Study for Representations of Videos in Video Question Answering via MLLMs

链接https://arxiv.org/abs/2510.12299

作者:Zhi Li,Yanan Wang,Hao Niu,Julio Vizcarra,Masato Taya

类目:Information Retrieval (cs.IR)

关键词:large language models, recently achieved remarkable, achieved remarkable progress, Multimodal large language, jointly processing visual

备注: 6 pages, 3 figures

点击查看摘要

Abstract:Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.

11. 【2510.12211】Reinforced Preference Optimization for Recommendation

链接https://arxiv.org/abs/2510.12211

作者:Junfei Tan,Yuxin Chen,An Zhang,Junguang Jiang,Bin Liu,Ziru Xu,Han Zhu,Jian Xu,Bo Zheng,Xiang Wang

类目:Information Retrieval (cs.IR)

关键词:Recent breakthroughs, user behavior modeling, fundamentally shifted recommender, shifted recommender systems, generating target items

备注

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards. To address these challenges, we propose Reinforced Preference Optimization for Recommendation (ReRe), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research.

12. 【2510.12054】MIARec: Mutual-influence-aware Heterogeneous Network Embedding for Scientific Paper Recommendation

链接https://arxiv.org/abs/2510.12054

作者:Wenjin Xie,Tao Jia

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:increasingly demand precise, scholars increasingly demand, high-quality paper recommendations, rapid expansion, increasingly demand

备注

点击查看摘要

Abstract:With the rapid expansion of scientific literature, scholars increasingly demand precise and high-quality paper recommendations. Among various recommendation methodologies, graph-based approaches have garnered attention by effectively exploiting the structural characteristics inherent in scholarly networks. However, these methods often overlook the asymmetric academic influence that is prevalent in scholarly networks when learning graph representations. To address this limitation, this study proposes the Mutual-Influence-Aware Recommendation (MIARec) model, which employs a gravity-based approach to measure the mutual academic influence between scholars and incorporates this influence into the feature aggregation process during message propagation in graph representation learning. Additionally, the model utilizes a multi-channel aggregation method to capture both individual embeddings of distinct single relational sub-networks and their interdependent embeddings, thereby enabling a more comprehensive understanding of the heterogeneous scholarly network. Extensive experiments conducted on real-world datasets demonstrate that the MIARec model outperforms baseline models across three primary evaluation metrics, indicating its effectiveness in scientific paper recommendation tasks.

13. 【2510.12014】Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval

链接https://arxiv.org/abs/2510.12014

作者:Eric He,Akash Gupta,Adian Liusie,Vatsal Raina,Piotr Molenda,Shirom Chabra,Vyas Raina

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:product recommendation, product recommendation applications, persona-driven product recommendation, CLIP enable efficient, Text

备注

点击查看摘要

Abstract:Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.

14. 【2510.11956】Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

链接https://arxiv.org/abs/2510.11956

作者:Gabrielle Kaili-May Liu,Bryan Li,Arman Cohan,William Gantt Walden,Eugene Yang

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Real-world use cases, RAG systems, RAG, underline, relevant information

备注

点击查看摘要

Abstract:Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

计算机视觉

1. 【2510.12801】DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

链接https://arxiv.org/abs/2510.12801

作者:Kartik Narayan,Yang Xu,Tian Cao,Kavya Nerella,Vishal M. Patel,Navid Shiee,Peter Grasch,Chao Jia,Yinfei Yang,Zhe Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Large Language Models, Multimodal Large Language, Large Language, applications require access, external knowledge sources

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

2. 【2510.12798】Detect Anything via Next Point Prediction

链接https://arxiv.org/abs/2510.12798

作者:Qing Jiang,Junan Huo,Xingyu Chen,Yuda Xiong,Zhaoyang Zeng,Yihao Chen,Tianhe Ren,Junzhi Yu,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Grounding DINO, long been dominated, dominated by traditional, DINO, traditional coordinate regression-based

备注: homepage: [this https URL](https://rex-omni.github.io/)

点击查看摘要

Abstract:Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.

3. 【2510.12796】DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

链接https://arxiv.org/abs/2510.12796

作者:Yingyan Li,Shuyao Shang,Weisong Liu,Bing Zhan,Haochen Wang,Yuqi Wang,Yuntao Chen,Xiaoman Wang,Yasong An,Chufeng Tang,Lu Hou,Lue Fan,Zhaoxiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generalized driving intelligence, large-scale data offers, offers a promising, promising path, path to achieving

备注

点击查看摘要

Abstract:Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

4. 【2510.12795】CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

链接https://arxiv.org/abs/2510.12795

作者:Caner Korkmaz,Brighton Nuwagira,Barış Coşkunuzer,Tolga Birdal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)

关键词:Cubical Multiparameter Persistence, differentiable vectorization layer, deep learning pipelines, Cubical Multiparameter, Multiparameter Persistence

备注: Appears at ICCV 2025

点击查看摘要

Abstract:We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging and computer vision datasets show the benefit CuMPerLay on classification and segmentation performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.

5. 【2510.12793】ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

链接https://arxiv.org/abs/2510.12793

作者:Long Cui,Weiyun Wang,Jie Shao,Zichen Wen,Gen Luo,Linfeng Zhang,Yanting Zhang,Yu Qiao,Wenhai Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Existing Multimodal Large, Large Language Models, Multimodal Large, Large Language

备注

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.

6. 【2510.12789】UniFusion: Vision-Language Model as Unified Encoder in Image Generation

链接https://arxiv.org/abs/2510.12789

作者:Kevin Li,Manuel Brack,Sudeep Katakol,Hareesh Ravi,Ajinkya Kale

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:VLM, Layerwise Attention Pooling, recent advances, depend on distinct, model

备注: Project page at [this https URL](https://thekevinli.github.io/unifusion/)

点击查看摘要

Abstract:Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its this http URL present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

7. 【2510.12788】Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report

链接https://arxiv.org/abs/2510.12788

作者:Daniel Feijoo,Paula Garrido-Mellado,Marcos V. Conde,Jaesung Rim,Alvaro Garcia,Sunghyun Cho,Radu Timofte

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Single Images Challenge, efficient real-blur restoration, real-blur restoration, Single Images, reviews the AIM

备注: ICCV 2025 - AIM Workshop

点击查看摘要

Abstract:This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model parameters and a computational budget under 200 GMACs. A total of 71 participants registered, with 4 teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 31.1298 dB, showcasing the potential of efficient methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers in efficient real-world image deblurring.

8. 【2510.12785】MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

链接https://arxiv.org/abs/2510.12785

作者:Felix Taubner,Ruihang Zhang,Mathieu Tuli,Sherwin Bahmani,David B. Lindell

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:enabling immersive experiences, virtual environments, virtual reality, human avatars aim, enabling immersive

备注: 18 pages, 12 figures

点击查看摘要

Abstract:Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.

9. 【2510.12784】SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

链接https://arxiv.org/abs/2510.12784

作者:Weiyang Jin,Yuwei Niu,Jiaqi Liao,Chengqi Duan,Aoxue Li,Shenghua Gao,Xihui Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Unified Multimodal Models, Unified Multimodal, made in Unified, integrate vision-language generation, Multimodal Models

备注: 20 pages, 8 figures, webpage can be seen in [this https URL](https://waynejin0918.github.io/srum_web/)

点击查看摘要

Abstract:Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a \textbf{global reward} ensures the correctness of the overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.

10. 【2510.12777】What If : Understanding Motion Through Sparse Interactions

链接https://arxiv.org/abs/2510.12777

作者:Stefan Andreas Baumann,Nick Stracke,Timy Phan,Björn Ommer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scene involves reasoning, Flow Poke Transformer, physical scene involves, potentially change, involves reasoning

备注: Project page and code: [this https URL](https://compvis.github.io/flow-poke-transformer)

点击查看摘要

Abstract:Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed "pokes". Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at this https URL.

11. 【2510.12768】Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction

链接https://arxiv.org/abs/2510.12768

作者:Fengzhi Guo,Chih-Chuan Hsu,Sihao Ding,Cheng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:dynamic Gaussian Splatting, Gaussian Splatting, Reconstructing dynamic, scenes from monocular, fundamentally under-constrained

备注: Project page: [this https URL](https://tamu-visual-ai.github.io/usplat4d/)

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our key insight is to estimate time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.

12. 【2510.12765】Efficient Perceptual Image Super Resolution: AIM 2025 Study and Benchmark

链接https://arxiv.org/abs/2510.12765

作者:Bruno Longarela,Marcos V. Conde,Alvaro Garcia,Radu Timofte

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Efficient Perceptual Super-Resolution, presents a comprehensive, comprehensive study, EPSR, Perceptual Super-Resolution

备注: ICCV 2025 - AIM Workshop

点击查看摘要

Abstract:This paper presents a comprehensive study and benchmark on Efficient Perceptual Super-Resolution (EPSR). While significant progress has been made in efficient PSNR-oriented super resolution, approaches focusing on perceptual quality metrics remain relatively inefficient. Motivated by this gap, we aim to replicate or improve the perceptual results of Real-ESRGAN while meeting strict efficiency constraints: a maximum of 5M parameters and 2000 GFLOPs, calculated for an input size of 960x540 pixels. The proposed solutions were evaluated on a novel dataset consisting of 500 test images of 4K resolution, each degraded using multiple degradation types, without providing the original high-quality counterparts. This design aims to reflect realistic deployment conditions and serves as a diverse and challenging benchmark. The top-performing approach manages to outperform Real-ESRGAN across all benchmark datasets, demonstrating the potential of efficient methods in the perceptual domain. This paper establishes the modern baselines for efficient perceptual super resolution.

13. 【2510.12764】AnyUp: Universal Feature Upsampling

链接https://arxiv.org/abs/2510.12764

作者:Thomas Wimmer,Prune Truong,Marie-Julie Rakotosaona,Michael Oechsle,Federico Tombari,Bernt Schiele,Jan Eric Lenssen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:encoder-specific training, DINO or CLIP, feature, vision feature, introduce AnyUp

备注: Project Website: [this https URL](https://wimmerth.github.io/anyup/)

点击查看摘要

Abstract:We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

14. 【2510.12758】PET Head Motion Estimation Using Supervised Deep Learning with Attention

链接https://arxiv.org/abs/2510.12758

作者:Zhuotong Cai,Tianyi Zeng,Jiazhen Zhang,Eléonore V. Lieffrig,Kathryn Fontaine,Chenyu You,Enette Mae Revilla,James S. Duncan,Jingmin Xin,Yihuan Lu,John A. Onofrey

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:positron emission tomography, Head movement poses, uptake quantification inaccuracies, tracer uptake quantification, brain positron emission

备注: Accepted for publication in IEEE Transactions on Medical Imaging (TMI), 2025. This is the accepted manuscript version

点击查看摘要

Abstract:Head movement poses a significant challenge in brain positron emission tomography (PET) imaging, resulting in image artifacts and tracer uptake quantification inaccuracies. Effective head motion estimation and correction are crucial for precise quantitative image analysis and accurate diagnosis of neurological disorders. Hardware-based motion tracking (HMT) has limited applicability in real-world clinical practice. To overcome this limitation, we propose a deep-learning head motion correction approach with cross-attention (DL-HMC++) to predict rigid head motion from one-second 3D PET raw data. DL-HMC++ is trained in a supervised manner by leveraging existing dynamic PET scans with gold-standard motion measurements from external HMT. We evaluate DL-HMC++ on two PET scanners (HRRT and mCT) and four radiotracers (18F-FDG, 18F-FPEB, 11C-UCB-J, and 11C-LSN3172176) to demonstrate the effectiveness and generalization of the approach in large cohort PET studies. Quantitative and qualitative results demonstrate that DL-HMC++ consistently outperforms state-of-the-art data-driven motion estimation methods, producing motion-free images with clear delineation of brain structures and reduced motion artifacts that are indistinguishable from gold-standard HMT. Brain region of interest standard uptake value analysis exhibits average difference ratios between DL-HMC++ and gold-standard HMT to be 1.2 plus-minus 0.5% for HRRT and 0.5 plus-minus 0.2% for mCT. DL-HMC++ demonstrates the potential for data-driven PET head motion correction to remove the burden of HMT, making motion correction accessible to clinical populations beyond research settings. The code is available at this https URL.

15. 【2510.12753】E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization

链接https://arxiv.org/abs/2510.12753

作者:Wenpu Li,Bangyan Liao,Yi Zhou,Qi Xu,Pian Wan,Peidong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:optical flow, addressed independently, fundamental tasks, typically been addressed, flow

备注: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)

点击查看摘要

Abstract:The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera's egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.

16. 【2510.12750】VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

链接https://arxiv.org/abs/2510.12750

作者:A. Alfarano,L. Venturoli,D. Negueruela del Castillo(University of Zurich, Max Planck Society)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated significant capabilities

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.

17. 【2510.12749】SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding

链接https://arxiv.org/abs/2510.12749

作者:Zhiliu Yang,Jinyu Dai,Jianyuan Zhang,Zhu Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sensor data sparsity, dynamic objects' interference, holistic scene understanding, embodied-AI agents, objects' interference

备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects' interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.

18. 【2510.12747】FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

链接https://arxiv.org/abs/2510.12747

作者:Junhao Zhuang,Shi Guo,Xin Cai,Xiaohui Li,Yihao Liu,Chun Yuan,Tianfan Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging due, advanced video restoration, recently advanced video, remains challenging, high latency

备注: Project page with code: [this https URL](https://zhuang2002.github.io/FlashVSR)

点击查看摘要

Abstract:Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.

19. 【2510.12741】Personalized Federated Fine-Tuning of Vision Foundation Models for Healthcare

链接https://arxiv.org/abs/2510.12741

作者:Adam Tupper,Christian Gagné

类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:Foundation models open, Foundation models, models open, data, foundation models reduce

备注: Accepted to the Symposium on Model Accountability, Sustainability and Healthcare (SMASH) 2025

点击查看摘要

Abstract:Foundation models open up new possibilities for the use of AI in healthcare. However, even when pre-trained on health data, they still need to be fine-tuned for specific downstream tasks. Furthermore, although foundation models reduce the amount of training data required to achieve good performance, obtaining sufficient data is still a challenge. This is due, in part, to restrictions on sharing and aggregating data from different sources to protect patients' privacy. One possible solution to this is to fine-tune foundation models via federated learning across multiple participating clients (i.e., hospitals, clinics, etc.). In this work, we propose a new personalized federated fine-tuning method that learns orthogonal LoRA adapters to disentangle general and client-specific knowledge, enabling each client to fully exploit both their own data and the data of others. Our preliminary results on real-world federated medical imaging tasks demonstrate that our approach is competitive against current federated fine-tuning methods.

20. 【2510.12720】Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

链接https://arxiv.org/abs/2510.12720

作者:Ziyang Ma,Ruiyang Xu,Zhenghao Xing,Yunfei Chu,Yuxuan Wang,Jinzheng He,Jin Xu,Pheng-Ann Heng,Kai Yu,Junyang Lin,Eng Siong Chng,Xie Chen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:advancing human-AI interaction, Omni Language Models, human-AI interaction, information is critical, critical for advancing

备注: [this https URL](https://github.com/ddlBoJack/Omni-Captioner)

点击查看摘要

Abstract:Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

21. 【2510.12712】Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

链接https://arxiv.org/abs/2510.12712

作者:Xingang Guo,Utkarsh Tyagi,Advait Gosai,Paula Vergara,Ernesto Gabriel Hernández Montoya,Chen Bo Calvin Zhang,Bin Hu,Yunzhong He,Bing Liu,Rakshith Sharma Srinivasa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, salient visual cues

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce IRIS, an Interactive Reasoning with Images and Systems that evaluates MLLMs' ability to perceive, transform, and reason across complex visual-textual tasks under the think with images paradigm. IRIS comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, IRIS offers critical insights for advancing visual intelligence in MLLMs.

22. 【2510.12709】SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

链接https://arxiv.org/abs/2510.12709

作者:Lin Lin,Jiefeng Long,Zhihe Wan,Yuchi Wang,Dingkang Yang,Shuang Yang,Yueyang Yao,Xu Chen,Zirui Guo,Shengqiang Li,Weiran Li,Hanyu Li,Yaling Mou,Yan Qiu,Haiyang Yu,Xiao Liang,Hongsheng Li,Chao Feng

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:informative unified representations, yield informative unified, informative unified, empower diverse cross-modal, training

备注: Technical Report

点击查看摘要

Abstract:Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.

23. 【2510.12704】Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

链接https://arxiv.org/abs/2510.12704

作者:Shelley Zixin Shu,Haozhe Luo,Alexander Poellinger,Mauricio Reyes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Transformer-based deep learning, demonstrated exceptional performance, Transformer-based deep, deep learning models, leveraging attention mechanisms

备注: Accepted by iMIMIC at MICCAI 2025

点击查看摘要

Abstract:Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.

24. 【2510.12691】DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization

链接https://arxiv.org/abs/2510.12691

作者:Danial Hosseintabar,Fan Chen,Giannis Daras,Antonio Torralba,Constantinos Daskalakis

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:high-dimensional inverse problems, powerful generative priors, Diffusion models, inverse problems, remains challenging

备注

点击查看摘要

Abstract:Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.

25. 【2510.12687】EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels

链接https://arxiv.org/abs/2510.12687

作者:Kunyu Peng,Di Wen,Kailun Yang,Jia Fu,Yufan Chen,Ruiping Liu,Jiamin Wu,Junwei Zheng,M. Saquib Sarfraz,Luc Van Gool,Danda Pani Paudel,Rainer Stiefelhagen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Open-Set Domain Generalization, enable deep learning, recognize unseen categories, Domain Generalization, deep learning models

备注: The source code is available at [this https URL](https://github.com/KPeng9510/ERELIFM)

点击查看摘要

Abstract:Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at this https URL.

26. 【2510.12679】MCOP: Multi-UAV Collaborative Occupancy Prediction

链接https://arxiv.org/abs/2510.12679

作者:Zefu Lin,Wenbo Chen,Xiaojuan Jin,Yuran Yang,Lue Fan,Yixin Zhang,Yufeng Zhang,Zhaoxiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, swarm systems necessitate, diverse operational scenarios

备注

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.

27. 【2510.12670】rraCodec: Compressing Earth Observations

链接https://arxiv.org/abs/2510.12670

作者:Julen Costa-Watanabe,Isabelle Wittmann,Benedikt Blumenstiel,Konrad Schindler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:satellites produce massive, posing pressing challenges, produce massive streams, satellites produce, posing pressing

备注

点击查看摘要

Abstract:Earth observation (EO) satellites produce massive streams of multispectral image time series, posing pressing challenges for storage and transmission. Yet, learned EO compression remains fragmented, lacking publicly available pretrained models and misaligned with advances in compression for natural imagery. Image codecs overlook temporal redundancy, while video codecs rely on motion priors that fail to capture the radiometric evolution of largely static scenes. We introduce TerraCodec (TEC), a family of learned codecs tailored to EO. TEC includes efficient image-based variants adapted to multispectral inputs, as well as a Temporal Transformer model (TEC-TT) that leverages dependencies across time. To overcome the fixed-rate setting of today's neural codecs, we present Latent Repacking, a novel method for training flexible-rate transformer models that operate on varying rate-distortion settings. Trained on Sentinel-2 data, TerraCodec outperforms classical codecs, achieving 3-10x stronger compression at equivalent image quality. Beyond compression, TEC-TT enables zero-shot cloud inpainting, surpassing state-of-the-art methods on the AllClear benchmark. Our results establish bespoke, learned compression algorithms as a promising direction for Earth observation. Code and model weights will be released under a permissive license.

28. 【2510.12660】On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

链接https://arxiv.org/abs/2510.12660

作者:Shuhei Tarashima,Yushan Wang,Norio Tagawa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human mesh recovery, human pose estimation, mesh recovery, predecessor task, pose estimation

备注: Accepted at ICCVW 2025

点击查看摘要

Abstract:In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.

29. 【2510.12646】Zero-Shot CFC: Fast Real-World Image Denoising based on Cross-Frequency Consistency

链接https://arxiv.org/abs/2510.12646

作者:Yanlin Jiang,Yuchen Liu,Mingren Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Zero-shot denoisers address, unseen single images, denoising, zero-shot methods, Zero-shot

备注: The British Machine Vision Conference

点击查看摘要

Abstract:Zero-shot denoisers address the dataset dependency of deep-learning-based denoisers, enabling the denoising of unseen single images. Nonetheless, existing zero-shot methods suffer from long training times and rely on the assumption of noise independence and a zero-mean property, limiting their effectiveness in real-world denoising scenarios where noise characteristics are more complicated. This paper proposes an efficient and effective method for real-world denoising, the Zero-Shot denoiser based on Cross-Frequency Consistency (ZSCFC), which enables training and denoising with a single noisy image and does not rely on assumptions about noise distribution. Specifically, image textures exhibit position similarity and content consistency across different frequency bands, while noise does not. Based on this property, we developed cross-frequency consistency loss and an ultralight network to realize image denoising. Experiments on various real-world image datasets demonstrate that our ZSCFC outperforms other state-of-the-art zero-shot methods in terms of computational efficiency and denoising performance.

30. 【2510.12605】WaterFlow: Explicit Physics-Prior Rectified Flow for Underwater Saliency Mask Generation

链接https://arxiv.org/abs/2510.12605

作者:Runting Li,Shijie Lian,Hua Li,Yutong Li,Wenhui Wu,Sam Kwong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces significant challenges, Salient Object Detection, Underwater Salient Object, including underwater image, image quality degradation

备注

点击查看摘要

Abstract:Underwater Salient Object Detection (USOD) faces significant challenges, including underwater image quality degradation and domain gaps. Existing methods tend to ignore the physical principles of underwater imaging or simply treat degradation phenomena in underwater images as interference factors that must be eliminated, failing to fully exploit the valuable information they contain. We propose WaterFlow, a rectified flow-based framework for underwater salient object detection that innovatively incorporates underwater physical imaging information as explicit priors directly into the network training process and introduces temporal dimension modeling, significantly enhancing the model's capability for salient object identification. On the USOD10K dataset, WaterFlow achieves a 0.072 gain in S_m, demonstrating the effectiveness and superiority of our method. The code will be published after the acceptance.

31. 【2510.12603】Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

链接https://arxiv.org/abs/2510.12603

作者:Chao Chen,Zhixin Ma,Yongqi Li,Yupeng Hu,Yinwei Wei,Wenjie Li,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:incorporating intermediate reasoning, reasoning, Multimodal reasoning aims, final answer, multimodal latent reasoning

备注

点击查看摘要

Abstract:Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at this https URL.

32. 【2510.12586】Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

链接https://arxiv.org/abs/2510.12586

作者:Jiachen Lei,Keli Liu,Julius Berner,Haiming Yu,Hongkai Zheng,Jiahong Wu,Xiangxiang Chu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generally underperform compared, Pixel-space generative models, leaving a persistent, difficult to train, train and generally

备注

点击查看摘要

Abstract:Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.

33. 【2510.12581】LayerSync: Self-aligning Intermediate Layers

链接https://arxiv.org/abs/2510.12581

作者:Yasaman Haghighi,Bastien van Delft,Mariam Hassan,Alexandre Alahi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diffusion models, diffusion model training, generation quality, diffusion, generation

备注

点击查看摘要

Abstract:We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at this https URL.

34. 【2510.12579】Unlocking Zero-Shot Plant Segmentation with Pl@ntNet Intelligence

链接https://arxiv.org/abs/2510.12579

作者:Simon Ravé,Jean-Christophe Lombardo,Pejman Rasti,Alexis Joly,David Rousseau

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale plant classification, zero-shot segmentation approach, leverages Plantnet, Plantnet specialized plant, plant classification model

备注

点击查看摘要

Abstract:We present a zero-shot segmentation approach for agricultural imagery that leverages Plantnet, a large-scale plant classification model, in conjunction with its DinoV2 backbone and the Segment Anything Model (SAM). Rather than collecting and annotating new datasets, our method exploits Plantnet's specialized plant representations to identify plant regions and produce coarse segmentation masks. These masks are then refined by SAM to yield detailed segmentations. We evaluate on four publicly available datasets of various complexity in terms of contrast including some where the limited size of the training data and complex field conditions often hinder purely supervised methods. Our results show consistent performance gains when using Plantnet-fine-tuned DinoV2 over the base DinoV2 model, as measured by the Jaccard Index (IoU). These findings highlight the potential of combining foundation models with specialized plant-centric models to alleviate the annotation bottleneck and enable effective segmentation in diverse agricultural scenarios.

35. 【2510.12573】Learning Human Motion with Temporally Conditional Mamba

链接https://arxiv.org/abs/2510.12573

作者:Quang Nguyen,Tri Le,Baoru Huang,Minh Nhat Vu,Ngan Le,Thieu Vo,Anh Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:time-dependent input signal, input signal presents, Learning human motion, Learning human, signal presents

备注: 10 pages

点击查看摘要

Abstract:Learning human motion based on a time-dependent input signal presents a challenging yet impactful task with various applications. The goal of this task is to generate or estimate human movement that consistently reflects the temporal patterns of conditioning inputs. Existing methods typically rely on cross-attention mechanisms to fuse the condition with motion. However, this approach primarily captures global interactions and struggles to maintain step-by-step temporal alignment. To address this limitation, we introduce Temporally Conditional Mamba, a new mamba-based model for human motion generation. Our approach integrates conditional information into the recurrent dynamics of the Mamba block, enabling better temporally aligned motion. To validate the effectiveness of our method, we evaluate it on a variety of human motion tasks. Extensive experiments demonstrate that our model significantly improves temporal alignment, motion realism, and condition consistency over state-of-the-art approaches. Our project page is available at this https URL.

36. 【2510.12565】MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking

链接https://arxiv.org/abs/2510.12565

作者:Tianhao Li,Tingfa Xu,Ying Wang,Haolin Qin,Xu Lin,Jianan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly challenging due, cluttered backgrounds, essential yet highly, drone-based multispectral multi-object, multispectral multi-object tracking

备注

点击查看摘要

Abstract:Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial cues that enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking. It features three key characteristics: (i) Large Scale - 125 video sequences with over 488.8K annotations across eight categories; (ii) Comprehensive Challenges - covering diverse conditions such as extreme small targets, high-density scenarios, severe occlusions, and complex motion; and (iii) Precise Oriented Annotations - enabling accurate localization and reduced ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) an orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will advance drone-based multispectral multi-object tracking research. Our MMOT, code, and benchmarks are publicly available at this https URL.

37. 【2510.12560】CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

链接https://arxiv.org/abs/2510.12560

作者:Xiaoji Zheng,Ziyuan Yang,Yanhao Chen,Yuhang Peng,Yuanrong Tang,Gengyuan Liu,Bokui Chen,Jiangtao Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:autonomous driving models, driving models trained, models trained solely, autonomous driving, driving models

备注: 18 pages, 17 figures

点击查看摘要

Abstract:End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: this https URL.

38. 【2510.12548】VISaGE: Understanding Visual Generics and Exceptions

链接https://arxiv.org/abs/2510.12548

作者:Stella Frank,Emily Allaway

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, Language Models, learn conceptual representations, generalized knowledge

备注: EMNLP 2025

点击查看摘要

Abstract:While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.

39. 【2510.12537】Unconditional Human Motion and Shape Generation via Balanced Score-Based Diffusion

链接https://arxiv.org/abs/2510.12537

作者:David Björkstrand,Tiesheng Wang,Lars Bretzner,Josephine Sullivan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Generative Adversarial Networks, including Variational Autoencoders, Variational Autoencoders, Generative Adversarial, Adversarial Networks

备注

点击查看摘要

Abstract:Recent work has explored a range of model families for human motion generation, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion-based models. Despite their differences, many methods rely on over-parameterized input features and auxiliary losses to improve empirical results. These strategies should not be strictly necessary for diffusion models to match the human motion distribution. We show that on par with state-of-the-art results in unconditional human motion generation are achievable with a score-based diffusion model using only careful feature-space normalization and analytically derived weightings for the standard L2 score-matching loss, while generating both motion and shape directly, thereby avoiding slow post hoc shape recovery from joints. We build the method step by step, with a clear theoretical motivation for each component, and provide targeted ablations demonstrating the effectiveness of each proposed addition in isolation.

40. 【2510.12524】Voronoi-Assisted Diffusion for Computing Unsigned Distance Fields from Unoriented Points

链接https://arxiv.org/abs/2510.12524

作者:Jiayi Kong,Chen Zong,Junkai Deng,Xuhui Chen,Fei Hou,Shiqing Xin,Junhui Hou,Chen Qian,Ying He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unsigned Distance Fields, Unsigned Distance, Distance Fields, provide a flexible, shapes with arbitrary

备注

点击查看摘要

Abstract:Unsigned Distance Fields (UDFs) provide a flexible representation for 3D shapes with arbitrary topology, including open and closed surfaces, orientable and non-orientable geometries, and non-manifold structures. While recent neural approaches have shown promise in learning UDFs, they often suffer from numerical instability, high computational cost, and limited controllability. We present a lightweight, network-free method, Voronoi-Assisted Diffusion (VAD), for computing UDFs directly from unoriented point clouds. Our approach begins by assigning bi-directional normals to input points, guided by two Voronoi-based geometric criteria encoded in an energy function for optimal alignment. The aligned normals are then diffused to form an approximate UDF gradient field, which is subsequently integrated to recover the final UDF. Experiments demonstrate that VAD robustly handles watertight and open surfaces, as well as complex non-manifold and non-orientable geometries, while remaining computationally efficient and stable.

41. 【2510.12493】BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring

链接https://arxiv.org/abs/2510.12493

作者:An Zhao,Piaopiao Yu,Zhe Zhu,Mingqiang Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:http URL, motion-blurred images caused, http URL performance, http URL solve, Gaussian Splatting

备注

点击查看摘要

Abstract:3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene this http URL, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant this http URL performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion this http URL solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred this http URL contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur this http URL alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both this http URL, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.

42. 【2510.12483】Fast Visuomotor Policy for Robotic Manipulation

链接https://arxiv.org/abs/2510.12483

作者:Jingkai Jia,Tong Yang,Xueyao Chen,Chenhuan Liu,Wenqiang Zhang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:named Energy Policy, Energy Policy, effective policy framework, designed for high-frequency, resource-constrained systems

备注

点击查看摘要

Abstract:We present a fast and effective policy framework for robotic manipulation, named Energy Policy, designed for high-frequency robotic tasks and resource-constrained systems. Unlike existing robotic policies, Energy Policy natively predicts multimodal actions in a single forward pass, enabling high-precision manipulation at high speed. The framework is built upon two core components. First, we adopt the energy score as the learning objective to facilitate multimodal action modeling. Second, we introduce an energy MLP to implement the proposed objective while keeping the architecture simple and efficient. We conduct comprehensive experiments in both simulated environments and real-world robotic tasks to evaluate the effectiveness of Energy Policy. The results show that Energy Policy matches or surpasses the performance of state-of-the-art manipulation methods while significantly reducing computational overhead. Notably, on the MimicGen benchmark, Energy Policy achieves superior performance with at a faster inference compared to existing approaches.

43. 【2510.12482】A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

链接https://arxiv.org/abs/2510.12482

作者:Shurong Chai,Rahul Kumar JAIN,Rui Xu,Shaocong Mo,Ruibo Hou,Shiyu Teng,Jiaqing Liu,Lanfen Lin,Yen-Wei Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Deep learning relies, mitigate limited data, learning relies heavily, Deep learning, limited data

备注

点击查看摘要

Abstract:Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: this https URL.

44. 【2510.12468】MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

链接https://arxiv.org/abs/2510.12468

作者:Dion J. X. Ho,Gabriel Lee Jun Rong,Niharika Shrivastava,Harshavardhan Abichandani,Pai Chet Ng,Xiaoxiao Miao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Guided Adversarial Generation, Metric-Selective Guided Adversarial, Metric-Selective Guided, Adversarial Generation Attack, Generation Attack

备注

点击查看摘要

Abstract:We present MS-GAGA (Metric-Selective Guided Adversarial Generation Attack), a two-stage framework for crafting transferable and visually imperceptible adversarial examples against deepfake detectors in black-box settings. In Stage 1, a dual-stream attack module generates adversarial candidates: MNTD-PGD applies enhanced gradient calculations optimized for small perturbation budgets, while SG-PGD focuses perturbations on visually salient regions. This complementary design expands the adversarial search space and improves transferability across unseen models. In Stage 2, a metric-aware selection module evaluates candidates based on both their success against black-box models and their structural similarity (SSIM) to the original image. By jointly optimizing transferability and imperceptibility, MS-GAGA achieves up to 27% higher misclassification rates on unseen detectors compared to state-of-the-art attacks.

45. 【2510.12451】A Function Centric Perspective On Flat and Sharp Minima

链接https://arxiv.org/abs/2510.12451

作者:Israel Mason-Williams,Gabryel Mason-Williams,Helen Yannakoudakis

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, neural networks, Flat minima, widely believed, believed to correlate

备注: 26 pages, 26 tables, 63 figures, pre-print

点击查看摘要

Abstract:Flat minima are widely believed to correlate with improved generalisation in deep neural networks. However, this connection has proven more nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance, proposing that sharpness is better understood as a function-dependent property rather than a reliable indicator of poor generalisation. We conduct extensive empirical studies, from single-objective optimisation to modern image classification tasks, showing that sharper minima often emerge when models are regularised (e.g., via SAM, weight decay, or data augmentation), and that these sharp minima can coincide with better generalisation, calibration, robustness, and functional consistency. Across a range of models and datasets, we find that baselines without regularisation tend to converge to flatter minima yet often perform worse across all safety metrics. Our findings demonstrate that function complexity, rather than flatness alone, governs the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases (especially under regularisation), calling for a function-centric reappraisal of loss landscape geometry.

46. 【2510.12444】A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation

链接https://arxiv.org/abs/2510.12444

作者:Shaoyang Zhou,Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Chest Xray imaging, high utilization creates, utilization creates substantial, creates substantial workloads, Chest Xray

备注

点击查看摘要

Abstract:Chest Xray imaging is a widely used diagnostic tool in modern medicine, and its high utilization creates substantial workloads for radiologists. To alleviate this burden, vision language models are increasingly applied to automate Chest Xray radiology report generation (CXRRRG), aiming for clinically accurate descriptions while reducing manual effort. Conventional approaches, however, typically rely on single images, failing to capture the longitudinal context necessary for producing clinically faithful comparison statements. Recently, growing attention has been directed toward incorporating longitudinal data into CXR RRG, enabling models to leverage historical studies in ways that mirror radiologists diagnostic workflows. Nevertheless, existing surveys primarily address single image CXRRRG and offer limited guidance for longitudinal settings, leaving researchers without a systematic framework for model design. To address this gap, this survey provides the first comprehensive review of longitudinal radiology report generation (LRRG). Specifically, we examine dataset construction strategies, report generation architectures alongside longitudinally tailored designs, and evaluation protocols encompassing both longitudinal specific measures and widely used benchmarks. We further summarize LRRG methods performance, alongside analyses of different ablation studies, which collectively highlight the critical role of longitudinal information and architectural design choices in improving model performance. Finally, we summarize five major limitations of current research and outline promising directions for future development, aiming to lay a foundation for advancing this emerging field.

47. 【2510.12422】VideoLucy: Deep Memory Backtracking for Long Video Understanding

链接https://arxiv.org/abs/2510.12422

作者:Jialong Zuo,Yongtai Deng,Lingdong Kong,Jingkang Yang,Rui Jin,Yiwei Zhang,Nong Sang,Liang Pan,Ziwei Liu,Changxin Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leveraging large language, Recent studies, systems leveraging large, key information retrieval, long video understanding

备注: NeurIPS-2025 Accepted Paper

点击查看摘要

Abstract:Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at this https URL

48. 【2510.12408】Low-Field Magnetic Resonance Image Quality Enhancement using a Conditional Flow Matching Model

链接https://arxiv.org/abs/2510.12408

作者:Huu Tien Nguyen,Ahmed Karam Eldaly

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:conditional flow matching, quality transfer based, paper introduces, transfer based, based on conditional

备注

点击查看摘要

Abstract:This paper introduces a novel framework for image quality transfer based on conditional flow matching (CFM). Unlike conventional generative models that rely on iterative sampling or adversarial objectives, CFM learns a continuous flow between a noise distribution and target data distributions through the direct regression of an optimal velocity field. We evaluate this approach in the context of low-field magnetic resonance imaging (LF-MRI), a rapidly emerging modality that offers affordable and portable scanning but suffers from inherently low signal-to-noise ratio and reduced diagnostic quality. Our framework is designed to reconstruct high-field-like MR images from their corresponding low-field inputs, thereby bridging the quality gap without requiring expensive infrastructure. Experiments demonstrate that CFM not only achieves state-of-the-art performance, but also generalizes robustly to both in-distribution and out-of-distribution data. Importantly, it does so while utilizing significantly fewer parameters than competing deep learning methods. These results underline the potential of CFM as a powerful and scalable tool for MRI reconstruction, particularly in resource-limited clinical environments.

49. 【2510.12400】owards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda

链接https://arxiv.org/abs/2510.12400

作者:André Torneiro,Diogo Monteiro,Paulo Novais,Pedro Rangel Henriques,Nuno F. Rodrigues

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:contextual conditions involved, poses significant challenges, significant challenges due, road signs, waste bins

备注: 44 pages

点击查看摘要

Abstract:Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens' perception formed through direct visual observation. This raises a critical question: Can machines now "see" like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?

50. 【2510.12387】Scene Coordinate Reconstruction Priors

链接https://arxiv.org/abs/2510.12387

作者:Wenjing Bian,Axel Barroso-Laguna,Tommaso Cavallari,Victor Adrian Prisacariu,Eric Brachmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:SCR models, powerful implicit scene, enabling visual relocalization, enabling visual, SCR

备注: ICCV 2025, Project page: [this https URL](https://nianticspatial.github.io/scr-priors/)

点击查看摘要

Abstract:Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.

51. 【2510.12385】Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling

链接https://arxiv.org/abs/2510.12385

作者:Tim J. Schoonbeek,Shao-Hsuan Hung,Dan Lehman,Hans Onvlee,Jacek Kustra,Peter H.N. de With,Fons van der Sommen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Procedure step recognition, aims to identify, procedural tasks, Procedure step, identify all correctly

备注: 26 pages, 7 figures and 5 tables in the main paper and one figure and table in the appendix. To be published in Computer Vision and Image Understanding

点击查看摘要

Abstract:Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at this https URL .

52. 【2510.12376】Deep Attention-guided Adaptive Subsampling

链接https://arxiv.org/abs/2510.12376

作者:Sharath M Shankaranarayana,Soumava Kumar Roy,Prasad Sudhakar,Chandan Aladahalli

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:provided impressive gains, increased computational complexity, provided impressive, cost of increased, increased computational

备注

点击查看摘要

Abstract:Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.

53. 【2510.12362】CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

链接https://arxiv.org/abs/2510.12362

作者:Jinzhou Lin,Jie Zhou,Wenhao Xu,Rongtao Xu,Changwei Wang,Shunpeng Chen,Kexue Fu,Yihua Shao,Li Guo,Shibiao Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic Scene Completion, aims to infer, infer complete, monocular images, autonomous driving

备注

点击查看摘要

Abstract:Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.

54. 【2510.12308】Hybrid Gaussian Splatting for Novel Urban View Synthesis

链接https://arxiv.org/abs/2510.12308

作者:Mohamed Omran,Farhad Zanjani,Davide Abati,Jens Petersen,Amirhossein Habibian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Workshop at ICCV, Qualcomm AI Research, describes the Qualcomm, RealADSim Workshop, Research solution

备注: ICCV 2025 RealADSim Workshop

点击查看摘要

Abstract:This paper describes the Qualcomm AI Research solution to the RealADSim-NVS challenge, hosted at the RealADSim Workshop at ICCV 2025. The challenge concerns novel view synthesis in street scenes, and participants are required to generate, starting from car-centric frames captured during some training traversals, renders of the same urban environment as viewed from a different traversal (e.g. different street lane or car direction). Our solution is inspired by hybrid methods in scene generation and generative simulators merging gaussian splatting and diffusion models, and it is composed of two stages: First, we fit a 3D reconstruction of the scene and render novel views as seen from the target cameras. Then, we enhance the resulting frames with a dedicated single-step diffusion model. We discuss specific choices made in the initialization of gaussian primitives as well as the finetuning of the enhancer model and its training data curation. We report the performance of our model design and we ablate its components in terms of novel view quality as measured by PSNR, SSIM and LPIPS. On the public leaderboard reporting test results, our proposal reaches an aggregated score of 0.432, achieving the second place overall.

55. 【2510.12287】Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector

链接https://arxiv.org/abs/2510.12287

作者:Sifan Li,Hongkai Chen,Yujun Cai,Qingwen Ye,Liyang Chen,Junsong Yuan,Yiwei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision Language Models, Vision Language, achieved impressive progress, Language Models, visual evidence

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.

56. 【2510.12283】Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval

链接https://arxiv.org/abs/2510.12283

作者:Jianfeng Dong,Lei Huang,Daizong Liu,Xianke Chen,Xun Yang,Changting Lin,Xun Wang,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:works ideally assume, solely text-related content, retrieval works ideally, Relevant Video Retrieval, works ideally

备注

点击查看摘要

Abstract:Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at this https URL.

57. 【2510.12282】PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes

链接https://arxiv.org/abs/2510.12282

作者:Ying A,Wenzhang Sun,Chang Zeng,Chunfeng Wang,Hao Li,Jianxun Cui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current methods face, Reconstructing dynamic, autonomous driving, computational cost, crucial for autonomous

备注

点击查看摘要

Abstract:Reconstructing dynamic 3D urban scenes is crucial for autonomous driving, yet current methods face a stark trade-off between fidelity and computational cost. This inefficiency stems from their semantically agnostic design, which allocates resources uniformly, treating static backgrounds and safety-critical objects with equal importance. To address this, we introduce Priority-Adaptive Gaussian Splatting (PAGS), a framework that injects task-aware semantic priorities directly into the 3D reconstruction and rendering pipeline. PAGS introduces two core contributions: (1) Semantically-Guided Pruning and Regularization strategy, which employs a hybrid importance metric to aggressively simplify non-critical scene elements while preserving fine-grained details on objects vital for navigation. (2) Priority-Driven Rendering pipeline, which employs a priority-based depth pre-pass to aggressively cull occluded primitives and accelerate the final shading computations. Extensive experiments on the Waymo and KITTI datasets demonstrate that PAGS achieves exceptional reconstruction quality, particularly on safety-critical objects, while significantly reducing training time and boosting rendering speeds to over 350 FPS.

58. 【2510.12267】SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

链接https://arxiv.org/abs/2510.12267

作者:Chenghanyu Zhang,Zekun Li,Peipei Li,Xing Cui,Shuhan Xia,Weixiang Yan,Yiqiao Zhang,Qianyu Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, integration of Multimodal

备注: Proceedings of the 33rd ACM International Conference on Multimedia,ACMMM 2025 Dataset Track

点击查看摘要

Abstract:With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at this https URL.

59. 【2510.12260】AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion

链接https://arxiv.org/abs/2510.12260

作者:Xiaopeng Liu,Yupei Lin,Sen Zhang,Xiao Wang,Yukai Shi,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Visible-infrared image fusion, Visible-infrared image, nighttime surveillance, crucial in key, autonomous driving

备注: For the first time, angle-based perception was introduced into the multi-modality image fusion task

点击查看摘要

Abstract:Visible-infrared image fusion is crucial in key applications such as autonomous driving and nighttime surveillance. Its main goal is to integrate multimodal information to produce enhanced images that are better suited for downstream tasks. Although deep learning based fusion methods have made significant progress, mainstream unsupervised approaches still face serious challenges in practical applications. Existing methods mostly rely on manually designed loss functions to guide the fusion process. However, these loss functions have obvious limitations. On one hand, the reference images constructed by existing methods often lack details and have uneven brightness. On the other hand, the widely used gradient losses focus only on gradient magnitude. To address these challenges, this paper proposes an angle-based perception framework for spatial-sensitive image fusion (AngularFuse). At first, we design a cross-modal complementary mask module to force the network to learn complementary information between modalities. Then, a fine-grained reference image synthesis strategy is introduced. By combining Laplacian edge enhancement with adaptive histogram equalization, reference images with richer details and more balanced brightness are generated. Last but not least, we introduce an angle-aware loss, which for the first time constrains both gradient magnitude and direction simultaneously in the gradient domain. AngularFuse ensures that the fused images preserve both texture intensity and correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and M3FD public datasets show that AngularFuse outperforms existing mainstream methods with clear margin. Visual comparisons further confirm that our method produces sharper and more detailed results in challenging scenes, demonstrating superior fusion capability.

60. 【2510.12259】Local Background Features Matter in Out-of-Distribution Detection

链接https://arxiv.org/abs/2510.12259

作者:Jinlun Ye,Zhuohao Sun,Yiqiao Qiu,Qiu Li,Zhijun Tan,Ruixuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deploying deep neural, OOD detection, OOD, deep neural networks, OOD detection method

备注

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial when deploying deep neural networks in the real world to ensure the reliability and safety of their applications. One main challenge in OOD detection is that neural network models often produce overconfident predictions on OOD data. While some methods using auxiliary OOD datasets or generating fake OOD images have shown promising OOD detection performance, they are limited by the high costs of data collection and training. In this study, we propose a novel and effective OOD detection method that utilizes local background features as fake OOD features for model training. Inspired by the observation that OOD images generally share similar background regions with ID images, the background features are extracted from ID images as simulated OOD visual representations during training based on the local invariance of convolution. Through being optimized to reduce the $L_2$-norm of these background features, the neural networks are able to alleviate the overconfidence issue on OOD data. Extensive experiments on multiple standard OOD detection benchmarks confirm the effectiveness of our method and its wide combinatorial compatibility with existing post-hoc methods, with new state-of-the-art performance achieved from our method.

61. 【2510.12258】Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images

链接https://arxiv.org/abs/2510.12258

作者:Yuto Yokoi,Kazuhiro Hotta

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiplicative Loss, Confidence-Adaptive Multiplicative Loss, Cross Entropy, Entropy and Dice, loss

备注: Accepted by ICCV2025 Workshop "Third Workshop on Computer Vision for Automated Medical Diagnosis"

点击查看摘要

Abstract:We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.

62. 【2510.12256】Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding

链接https://arxiv.org/abs/2510.12256

作者:Ye Chen,Liming Tan,Yupeng Zhu,Yuanbin Wang,Bingbing Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current video representations, representations heavily rely, Current video, video representations heavily, pixel-level matching

备注

点击查看摘要

Abstract:Current video representations heavily rely on unstable and over-grained priors for motion and appearance modelling, \emph{i.e.}, pixel-level matching and tracking. A tracking error of just a few pixels would lead to the collapse of the visual object representation, not to mention occlusions and large motion frequently occurring in videos. To overcome the above mentioned vulnerability, this work proposes spatio-temporally consistent proxy nodes to represent dynamically changing objects/scenes in the video. On the one hand, the hierarchical proxy nodes have the ability to stably express the multi-scale structure of visual objects, so they are not affected by accumulated tracking error, long-term motion, occlusion, and viewpoint variation. On the other hand, the dynamic representation update mechanism of the proxy nodes adequately leverages spatio-temporal priors of the video to mitigate the impact of inaccurate trackers, thereby effectively handling drastic changes in scenes and objects. Additionally, the decoupled encoding manner of the shape and texture representations across different visual objects in the video facilitates controllable and fine-grained appearance editing capability. Extensive experiments demonstrate that the proposed representation achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks, including video in-painting and keyframe-based temporally consistent video editing.

63. 【2510.12241】Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection

链接https://arxiv.org/abs/2510.12241

作者:Yuehui Li,Yahao Lu,Haoyuan Wu,Sen Zhang,Liang Lin,Yukai Shi

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Infrared Small Target, Small Target Detection, Infrared Small, drone-based multi-modality sensing, Target Detection

备注: In infrared small target detection, noise from different sensors can cause significant interference to performance. We propose a new dataset and a wavelet-guided Invariance learning framework(Ivan-ISTD) to emphasize this issue

点击查看摘要

Abstract:In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: this https URL.

64. 【2510.12231】BIGFix: Bidirectional Image Generation with Token Fixing

链接https://arxiv.org/abs/2510.12231

作者:Victor Besnier,David Hurych,Andrei Bursuc,Eduardo Valle

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:raised significant interest, Recent advances, academia and industry, raised significant, significant interest

备注

点击查看摘要

Abstract:Recent advances in image and video generation have raised significant interest from both academia and industry. A key challenge in this field is improving inference efficiency, as model size and the number of inference steps directly impact the commercial viability of generative models while also posing fundamental scientific challenges. A promising direction involves combining auto-regressive sequential token modeling with multi-token prediction per step, reducing inference time by up to an order of magnitude. However, predicting multiple tokens in parallel can introduce structural inconsistencies due to token incompatibilities, as capturing complex joint dependencies during training remains challenging. Traditionally, once tokens are sampled, there is no mechanism to backtrack and refine erroneous predictions. We propose a method for self-correcting image generation by iteratively refining sampled tokens. We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling. Our method preserves the efficiency benefits of parallel token prediction while significantly enhancing generation quality. We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.

65. 【2510.12225】HoneyBee: Data Recipes for Vision-Language Reasoners

链接https://arxiv.org/abs/2510.12225

作者:Hritik Bansal,Devandra Singh Sachan,Kai-Wei Chang,Aditya Grover,Gargi Ghosh,Wen-tau Yih,Ramakanth Pasunuru

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent advances, advances in vision-language, made them highly, highly effective, reasoning

备注: 32 pages

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.

66. 【2510.12219】DIANet: A Phase-Aware Dual-Stream Network for Micro-Expression Recognition via Dynamic Images

链接https://arxiv.org/abs/2510.12219

作者:Vu Tram Anh Khuong,Luu Tu Nguyen,Thi Bich Phuong Man,Thanh Ha Le,Thi Duyen Ngo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reveal genuine emotions, involuntary facial movements, genuine emotions, movements that typically, reveal genuine

备注

点击查看摘要

Abstract:Micro-expressions are brief, involuntary facial movements that typically last less than half a second and often reveal genuine emotions. Accurately recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. However, micro-expression recognition (MER) remains a challenging task due to the subtle and transient nature of facial cues and the limited availability of annotated data. While dynamic image (DI) representations have been introduced to summarize temporal motion into a single frame, conventional DI-based methods often overlook the distinct characteristics of different temporal phases within a micro-expression. To address this issue, this paper proposes a novel dual-stream framework, DIANet, which leverages phase-aware dynamic images - one encoding the onset-to-apex phase and the other capturing the apex-to-offset phase. Each stream is processed by a dedicated convolutional neural network, and a cross-attention fusion module is employed to adaptively integrate features from both streams based on their contextual relevance. Extensive experiments conducted on three benchmark MER datasets (CASME-II, SAMM, and MMEW) demonstrate that the proposed method consistently outperforms conventional single-phase DI-based approaches. The results highlight the importance of modeling temporal phase information explicitly and suggest a promising direction for advancing MER.

67. 【2510.12208】he Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data

链接https://arxiv.org/abs/2510.12208

作者:Muammer Bay,Timo von Marcard,Dren Fazlija

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, offer new opportunities, workflows across industries, advances in generative, opportunities to optimize

备注: 18 pages, 12 figures, 2 tables. Code: [this https URL](https://github.com/MuammerBay/omniverse-replicator-sim2real-analysis) ; Data: [this https URL](https://doi.org/10.5281/zenodo.17308406)

点击查看摘要

Abstract:Recent advances in generative AI, particularly in computer vision (CV), offer new opportunities to optimize workflows across industries, including logistics and manufacturing. However, many AI applications are limited by a lack of expertise and resources, which forces a reliance on general-purpose models. Success with these models often requires domain-specific data for fine-tuning, which can be costly and inefficient. Thus, using synthetic data for fine-tuning is a popular, cost-effective alternative to gathering real-world data. This work investigates the impact of synthetic data on the performance of object detection models, compared to models trained on real-world data only, specifically within the domain of warehouse logistics. To this end, we examined the impact of synthetic data generated using the NVIDIA Omniverse Replicator tool on the effectiveness of object detection models in real-world scenarios. It comprises experiments focused on pallet detection in a warehouse setting, utilizing both real and various synthetic dataset generation strategies. Our findings provide valuable insights into the practical applications of synthetic image data in computer vision, suggesting that a balanced integration of synthetic and real data can lead to robust and efficient object detection models.

68. 【2510.12190】Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos

链接https://arxiv.org/abs/2510.12190

作者:Shingo Yokoi,Kento Sasaki,Yu Yamaguchi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale driving datasets, diverse large-scale driving, autonomous driving models, autonomous driving, Recent advances

备注: 2nd Place Winner, ICCV 2025 2COOOL Competition

点击查看摘要

Abstract:Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at this https URL.

69. 【2510.12184】CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

链接https://arxiv.org/abs/2510.12184

作者:Jiwan Kim,Kibum Kim,Sangwoo Seo,Chanyoung Park

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, efficient Multimodal Large, Multimodal Large, Large Language

备注: Preprint. Under Review

点击查看摘要

Abstract:Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

70. 【2510.12182】BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation

链接https://arxiv.org/abs/2510.12182

作者:Youngju Yoo,Seho Kim,Changick Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:require dense point-level, substantial annotation costs, dense point-level annotations, methods require dense, understanding complex

备注

点击查看摘要

Abstract:3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point-level annotations, resulting in substantial annotation costs and labor overhead. To mitigate this, box-level annotations have been explored as a weaker but more scalable form of supervision. However, box annotations inherently introduce ambiguity in overlapping regions, making accurate point-to-instance assignment challenging. Recent methods address this ambiguity by generating pseudo-masks through training a dedicated pseudo-labeler in an additional training stage. However, such two-stage pipelines often increase overall training time and complexity, hinder end-to-end optimization. To overcome these challenges, we propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance segmentation. BEEP3D adopts a student-teacher framework, where the teacher model serves as a pseudo-labeler and is updated by the student model via an Exponential Moving Average. To better guide the teacher model to generate precise pseudo-masks, we introduce an instance center-based query refinement that enhances position query localization and leverages features near instance centers. Additionally, we design two novel losses-query consistency loss and masked feature consistency loss-to align semantic and geometric signals between predictions and pseudo-masks. Extensive experiments on ScanNetV2 and S3DIS datasets demonstrate that BEEP3D achieves competitive or superior performance compared to state-of-the-art weakly supervised methods while remaining computationally efficient.

71. 【2510.12174】UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering

链接https://arxiv.org/abs/2510.12174

作者:Yusen Xie,Zhenmin Huang,Jianhao Jiao,Dimitrios Kanoulas,Jun Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:unified map representation, Gaussian Splatting, propose UniGS, photo-realistic RGB images, unified map

备注

点击查看摘要

Abstract:In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.

72. 【2510.12160】State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

链接https://arxiv.org/abs/2510.12160

作者:Jiahuan Zhou,Kai Zhu,Zhenyu Cui,Zichen Liu,Xu Zou,Gang Hua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:maintaining high performance, shown great potential, State Space Prompting, pre-trained state space, linear complexity

备注

点击查看摘要

Abstract:Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.

73. 【2510.12159】DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation

链接https://arxiv.org/abs/2510.12159

作者:Ziyuan Gao,Philippe Morel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces fundamental challenges, limited annotated data, segmentation faces fundamental, significant anatomical variability, variability across patients

备注: Accepted at IVCNZ 2025. To be published in IEEE proceedings

点击查看摘要

Abstract:One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation.

74. 【2510.12150】Class-aware Domain Knowledge Fusion and Fission for Continual Test-Time Adaptation

链接https://arxiv.org/abs/2510.12150

作者:Jiahuan Zhou,Chao Zhu,Zhenyu Cui,Zichen Liu,Xu Zou,Gang Hua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple unknown downstream, Continual Test-Time Adaptation, downstream domain data, knowledge, unknown downstream domain

备注

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to quickly fine-tune the model during the test phase so that it can adapt to multiple unknown downstream domain distributions without pre-acquiring downstream domain data. To this end, existing advanced CTTA methods mainly reduce the catastrophic forgetting of historical knowledge caused by irregular switching of downstream domain data by restoring the initial model or reusing historical models. However, these methods are usually accompanied by serious insufficient learning of new knowledge and interference from potentially harmful historical knowledge, resulting in severe performance degradation. To this end, we propose a class-aware domain Knowledge Fusion and Fission method for continual test-time adaptation, called KFF, which adaptively expands and merges class-aware domain knowledge in old and new domains according to the test-time data from different domains, where discriminative historical knowledge can be dynamically accumulated. Specifically, considering the huge domain gap within streaming data, a domain Knowledge FIssion (KFI) module is designed to adaptively separate new domain knowledge from a paired class-aware domain prompt pool, alleviating the impact of negative knowledge brought by old domains that are distinct from the current domain. Besides, to avoid the cumulative computation and storage overheads from continuously fissioning new knowledge, a domain Knowledge FUsion (KFU) module is further designed to merge the fissioned new knowledge into the existing knowledge pool with minimal cost, where a greedy knowledge dynamic merging strategy is designed to improve the compatibility of new and old knowledge while keeping the computational efficiency. Extensive experiments on the ImageNet-C dataset verify the effectiveness of our proposed method against other methods.

75. 【2510.12132】FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements

链接https://arxiv.org/abs/2510.12132

作者:Xiao Yang,Jiyao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote physiological measurement, gained wide attention, users' privacy-sensitive information, physiological measurement gained, measurement gained wide

备注

点击查看摘要

Abstract:Remote physiological measurement gained wide attention, while it requires collecting users' privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbf{Fed}erated \textbf{H}eterogeneous \textbf{U}nsupervised \textbf{G}eneralization (\textbf{FedHUG}) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.

76. 【2510.12126】MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

链接https://arxiv.org/abs/2510.12126

作者:Zhenxin Lei,Zhangwei Gao,Changyao Tian,Erfei Cui,Guanzhou Chen,Danni Yang,Yuchen Duan,Zhaokai Wang,Wenhao Li,Weiyun Wang,Xiangyu Zhao,Jiayi Ji,Yu Qiao,Wenhai Wang,Gen Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simple appearance description, appearance description task, simple appearance, appearance description, requires integrating

备注

点击查看摘要

Abstract:Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

77. 【2510.12123】Hardware-aware Coding Function Design for Compressive Single-Photon 3D Cameras

链接https://arxiv.org/abs/2510.12123

作者:David Parra,Felipe Gutierrez-Barragan,Trevor Seets,Andreas Velten

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:time-tag individual photons, extreme resolution, increasingly popular, time-tag individual, Single-photon cameras

备注: IEEE TPAMI Special Issue

点击查看摘要

Abstract:Single-photon cameras are becoming increasingly popular in time-of-flight 3D imaging because they can time-tag individual photons with extreme resolution. However, their performance is susceptible to hardware limitations, such as system bandwidth, maximum laser power, sensor data rates, and in-sensor memory and compute resources. Compressive histograms were recently introduced as a solution to the challenge of data rates through an online in-sensor compression of photon timestamp data. Although compressive histograms work within limited in-sensor memory and computational resources, they underperform when subjected to real-world illumination hardware constraints. To address this, we present a constrained optimization approach for designing practical coding functions for compressive single-photon 3D imaging. Using gradient descent, we jointly optimize an illumination and coding matrix (i.e., the coding functions) that adheres to hardware constraints. We show through extensive simulations that our coding functions consistently outperform traditional coding designs under both bandwidth and peak power constraints. This advantage is particularly pronounced in systems constrained by peak power. Finally, we show that our approach adapts to arbitrary parameterized impulse responses by evaluating it on a real-world system with a non-ideal impulse response function.

78. 【2510.12119】ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

链接https://arxiv.org/abs/2510.12119

作者:Ziyuan Luo,Yangyi Zhao,Ka Chun Cheung,Simon See,Renjie Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:raised significant concerns, widespread adoption, adoption of Retrieval-Augmented, raised significant, significant concerns

备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at this https URL.

79. 【2510.12114】Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration

链接https://arxiv.org/abs/2510.12114

作者:Wenjie Li,Xiangyi Wang,Heng Guo,Guangwei Gao,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses significant challenges, significant challenges due, Old-photo face restoration, restoration poses significant, severe blur

备注

点击查看摘要

Abstract:Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: this https URL.

80. 【2510.12107】DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning

链接https://arxiv.org/abs/2510.12107

作者:Jiawei Zhan,Jun Liu,Jinlong Peng,Xiaochen Chen,Bin-Bin Gao,Yong Liu,Chengjie Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:non-rehearsal Class-Incremental Learning, excellent representation capabilities, Discriminative Representation Learning, Learning, remarkable progress

备注: 13 pages, 7 figures

点击查看摘要

Abstract:With the excellent representation capabilities of Pre-Trained Models (PTMs), remarkable progress has been made in non-rehearsal Class-Incremental Learning (CIL) research. However, it remains an extremely challenging task due to three conundrums: increasingly large model complexity, non-smooth representation shift during incremental learning and inconsistency between stage-wise sub-problem optimization and global inference. In this work, we propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges. To conduct incremental learning effectively and yet efficiently, the DRL's network, called Incremental Parallel Adapter (IPA) network, is built upon a PTM and increasingly augments the model by learning a lightweight adapter with a small amount of parameter learning overhead in each incremental stage. The adapter is responsible for adapting the model to new classes, it can inherit and propagate the representation capability from the current model through parallel connection between them by a transfer gate. As a result, this design guarantees a smooth representation shift between different incremental stages. Furthermore, to alleviate inconsistency and enable comparable feature representations across incremental stages, we design the Decoupled Anchor Supervision (DAS). It decouples constraints of positive and negative samples by respectively comparing them with the virtual anchor. This decoupling promotes discriminative representation learning and aligns the feature spaces learned at different stages, thereby narrowing the gap between stage-wise local optimization over a subset of data and global inference across all classes. Extensive experiments on six benchmarks reveal that our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period while maintaining high efficiency in both training and inference phases.

81. 【2510.12101】Gaussian Semantic Field for One-shot LiDAR Global Localization

链接https://arxiv.org/abs/2510.12101

作者:Pengyu Yin,Shenghai Yuan,Haozhi Cao,Xingyu Ji,Ruofei Bai,Siyu Chen,Lihua Xie

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:disambiguation ability based, localization algorithm featuring, algorithm featuring semantic, featuring semantic disambiguation, semantic disambiguation ability

备注

点击查看摘要

Abstract:We present a one-shot LiDAR global localization algorithm featuring semantic disambiguation ability based on a lightweight tri-layered scene graph. While landmark semantic registration-based methods have shown promising performance improvements in global localization compared with geometric-only methods, landmarks can be repetitive and misleading for correspondence establishment. We propose to mitigate this problem by modeling semantic distributions with continuous functions learned from a population of Gaussian processes. Compared with discrete semantic labels, the continuous functions capture finer-grained geo-semantic information and also provide more detailed metric information for correspondence establishment. We insert this continuous function as the middle layer between the object layer and the metric-semantic layer, forming a tri-layered 3D scene graph, serving as a light-weight yet performant backend for one-shot localization. We term our global localization pipeline Outram-GSF (Gaussian semantic field) and conduct a wide range of experiments on publicly available data sets, validating the superior performance against the current state-of-the-art.

82. 【2510.12099】G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

链接https://arxiv.org/abs/2510.12099

作者:Junfeng Ni,Yixin Chen,Zhifei Yang,Yu Liu,Ruijie Lu,Song-Chun Zhu,Siyuan Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leveraging generative prior, critical limitations, recent advances, advances in leveraging, prior from pre-trained

备注: Project page: [this https URL](https://dali-jack.github.io/g4splat-web/)

点击查看摘要

Abstract:Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at this https URL.

83. 【2510.12098】An Adaptive Edge-Guided Dual-Network Framework for Fast QR Code Motion Deblurring

链接https://arxiv.org/abs/2510.12098

作者:Jianping Li,Dongyang Guo,Wenjie Li,Wei Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unlike general image, prioritizes perceptual quality, Unlike general, perceptual quality, ensuring successful decoding

备注

点击查看摘要

Abstract:Unlike general image deblurring that prioritizes perceptual quality, QR code deblurring focuses on ensuring successful decoding. QR codes are characterized by highly structured patterns with sharp edges, a robust prior for restoration. Yet existing deep learning methods rarely exploit these priors explicitly. To address this gap, we propose the Edge-Guided Attention Block (EGAB), which embeds explicit edge priors into a Transformer architecture. Based on EGAB, we develop Edge-Guided Restormer (EG-Restormer), an effective network that significantly boosts the decoding rate of severely blurred QR codes. For mildly blurred inputs, we design the Lightweight and Efficient Network (LENet) for fast deblurring. We further integrate these two networks into an Adaptive Dual-network (ADNet), which dynamically selects the suitable network based on input blur severity, making it ideal for resource-constrained mobile devices. Extensive experiments show that our EG-Restormer and ADNet achieve state-of-the-art performance with a competitive speed. Project page: this https URL

84. 【2510.12095】IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation

链接https://arxiv.org/abs/2510.12095

作者:Wenxu Zhou,Kaixuan Nie,Hang Du,Dong Yin,Wei Huang,Siqiang Guo,Xiaobo Zhang,Pengbo Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language model, indoor layout design, high-quality training data, large-scale dataset meticulously, dataset meticulously designed

备注: 9 pages main paper; 15 pages references and appendix

点击查看摘要

Abstract:In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.

85. 【2510.12089】Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback

链接https://arxiv.org/abs/2510.12089

作者:Xingpei Ma,Shenneng Huang,Jiaran Cai,Yuansheng Guan,Shen Zheng,Hanfeng Zhao,Qiang Zhang,Shunsi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, surpassing traditional methods, significantly improved audio-driven, improved audio-driven human, human video generation

备注

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.

86. 【2510.12075】A Review on Domain Adaption and Generative Adversarial Networks(GANs)

链接https://arxiv.org/abs/2510.12075

作者:Aashish Dhawan,Divyanshu Mudgal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:today computer vision, computer vision scenario, good quality labeled, quality labeled data, major challenge

备注

点击查看摘要

Abstract:The major challenge in today's computer vision scenario is the availability of good quality labeled data. In a field of study like image classification, where data is of utmost importance, we need to find more reliable methods which can overcome the scarcity of data to produce results comparable to previous benchmark results. In most cases, obtaining labeled data is very difficult because of the high cost of human labor and in some cases impossible. The purpose of this paper is to discuss Domain Adaptation and various methods to implement it. The main idea is to use a model trained on a particular dataset to predict on data from a different domain of the same kind, for example - a model trained on paintings of airplanes predicting on real images of airplanes

87. 【2510.12069】VIDMP3: Video Editing by Representing Motion with Pose and Position Priors

链接https://arxiv.org/abs/2510.12069

作者:Sandeep Mishra,Oindrila Saha,Alan C. Bovik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Motion-preserved video editing, crucial for creators, swapped objects, Motion-preserved video, scenarios that demand

备注

点击查看摘要

Abstract:Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at this https URL.

88. 【2510.12060】Your VAR Model is Secretly an Efficient and Explainable Generative Classifier

链接https://arxiv.org/abs/2510.12060

作者:Yi-Chung Chen,David I. Inouye,Jing Gao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:recently demonstrated desirable, leverage conditional generative, demonstrated desirable properties, conditional generative models, distribution shifts

备注

点击查看摘要

Abstract:Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost severely limits scalability. This exclusive focus on diffusion-based methods has also constrained our understanding of generative classifiers. In this work, we propose a novel generative classifier built on recent advances in visual autoregressive (VAR) modeling, which offers a new perspective for studying generative classifiers. To further enhance its performance, we introduce the Adaptive VAR Classifier$^+$ (A-VARC$^+$), which achieves a superior trade-off between accuracy and inference speed, thereby significantly improving practical applicability. Moreover, we show that the VAR-based method exhibits fundamentally different properties from diffusion-based methods. In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via token-wise mutual information and demonstrates inherent resistance to catastrophic forgetting in class-incremental learning tasks.

89. 【2510.12056】APGNet: Adaptive Prior-Guided for Underwater Camouflaged Object Detection

链接https://arxiv.org/abs/2510.12056

作者:Xinxin Huang,Han Sun,Junmin Cai,Ningzhong Liu,Huiyu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Detecting camouflaged objects, marine ecological research, Detecting camouflaged, resource exploration, ecological research

备注: 6 pages. accepted by ACM MM Asia 2025

点击查看摘要

Abstract:Detecting camouflaged objects in underwater environments is crucial for marine ecological research and resource exploration. However, existing methods face two key challenges: underwater image degradation, including low contrast and color distortion, and the natural camouflage of marine organisms. Traditional image enhancement techniques struggle to restore critical features in degraded images, while camouflaged object detection (COD) methods developed for terrestrial scenes often fail to adapt to underwater environments due to the lack of consideration for underwater optical characteristics. To address these issues, we propose APGNet, an Adaptive Prior-Guided Network, which integrates a Siamese architecture with a novel prior-guided mechanism to enhance robustness and detection accuracy. First, we employ the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm for data augmentation, generating illumination-invariant images to mitigate degradation effects. Second, we design an Extended Receptive Field (ERF) module combined with a Multi-Scale Progressive Decoder (MPD) to capture multi-scale contextual information and refine feature representations. Furthermore, we propose an adaptive prior-guided mechanism that hierarchically fuses position and boundary priors by embedding spatial attention in high-level features for coarse localization and using deformable convolution to refine contours in low-level features. Extensive experimental results on two public MAS datasets demonstrate that our proposed method APGNet outperforms 15 state-of-art methods under widely used evaluation metrics.

Comments:
6 pages. accepted by ACM MM Asia 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.12056 [cs.CV]

(or
arXiv:2510.12056v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.12056

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
90. 【2510.12021】Evaluating the Explainability of Vision Transformers in Medical Imaging

链接https://arxiv.org/abs/2510.12021

作者:Leili Barekatain,Ben Glocker

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interpretability directly impacts, directly impacts clinical, impacts clinical trust, Understanding model decisions, Gradient Attention Rollout

备注: Accepted at Workshop on Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2025

点击查看摘要

Abstract:Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

91. 【2510.11996】Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

链接https://arxiv.org/abs/2510.11996

作者:Tanner Muturi,Blessing Agyei Kyem,Joshua Kofi Asamoah,Neema Jakisa Owor,Richard Dyzinela,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise spatial understanding, vision-language systems due, Spatial Intelligence Warehouse, scene clutter, remains a significant

备注: The paper was accepted at ICCV Conference 2025

点击查看摘要

Abstract:Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

92. 【2510.11992】PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation

链接https://arxiv.org/abs/2510.11992

作者:Hatem Ibrahem,Ahmed Salem,Qinmin Vivian Hu,Guanghui Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurately estimating, Convolutional Neural Network, Thin Plate Spline, TPS spatial transformation, augmented reality

备注

点击查看摘要

Abstract:Accurately estimating the 3D layout of rooms is a crucial task in computer vision, with potential applications in robotics, augmented reality, and interior design. This paper proposes a novel model, PanoTPS-Net, to estimate room layout from a single panorama image. Leveraging a Convolutional Neural Network (CNN) and incorporating a Thin Plate Spline (TPS) spatial transformation, the architecture of PanoTPS-Net is divided into two stages: First, a convolutional neural network extracts the high-level features from the input images, allowing the network to learn the spatial parameters of the TPS transformation. Second, the TPS spatial transformation layer is generated to warp a reference layout to the required layout based on the predicted parameters. This unique combination empowers the model to properly predict room layouts while also generalizing effectively to both cuboid and non-cuboid layouts. Extensive experiments on publicly available datasets and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed method. The results underscore the model's accuracy in room layout estimation and emphasize the compatibility between the TPS transformation and panorama images. The robustness of the model in handling both cuboid and non-cuboid room layout estimation is evident with a 3DIoU value of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets, respectively. The source code is available at: this https URL.

93. 【2510.11962】MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics

链接https://arxiv.org/abs/2510.11962

作者:Bowei Guo,Shengkun Tang,Cong Zeng,Zhiqiang Shen

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:processes exhibit distinct, exhibit distinct phases, pretraining processes exhibit, prior post-training acceleration, post-training acceleration efforts

备注: International Conference on Computer Vision, ICCV 2025

点击查看摘要

Abstract:Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.

94. 【2510.11907】ask-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis

链接https://arxiv.org/abs/2510.11907

作者:Blessing Agyei Kyem,Neema Jakisa Owor,Andrews Danyo,Joshua Kofi Asamoah,Eugene Denteh,Tanner Muturi,Anthony Dontoh,Yaw Adu-Gyamfi,Armstrong Aboah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traffic safety analysis, safety analysis requires, analysis requires complex, requires complex video, capture fine-grained behavioral

备注: This paper was accepted at ICCV 2025

点击查看摘要

Abstract:Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80\%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6\% in VQA accuracy while maintaining captioning quality.

95. 【2510.11883】MammoDINO: Anatomically Aware Self-Supervision for Mammographic Images

链接https://arxiv.org/abs/2510.11883

作者:Sicheng Zhou,Lei Wu,Cao Xiao,Parminder Bhatia,Taha Kass-Hout

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:domain specific biases, transformed vision encoder, vision encoder training, medical imaging due, Self-supervised learning

备注: 5 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) has transformed vision encoder training in general domains but remains underutilized in medical imaging due to limited data and domain specific biases. We present MammoDINO, a novel SSL framework for mammography, pretrained on 1.4 million mammographic images. To capture clinically meaningful features, we introduce a breast tissue aware data augmentation sampler for both image-level and patch-level supervision and a cross-slice contrastive learning objective that leverages 3D digital breast tomosynthesis (DBT) structure into 2D pretraining. MammoDINO achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. It offers a scalable, annotation-free foundation for multipurpose computer-aided diagnosis (CAD) tools for mammogram, helping reduce radiologists' workload and improve diagnostic efficiency in breast cancer screening.

96. 【2510.11878】GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality

链接https://arxiv.org/abs/2510.11878

作者:Anastasiya Pechko,Piotr Borycki,Joanna Waczyńska,Daniel Barczyk,Agata Szymańska,Sławomir Tadeja,Przemysław Spurek

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:textbf, demand for immersive, Virtual Reality, efficient interaction methods, content grows

备注

点击查看摘要

Abstract:As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce \our{} (\textbf{G}aussian \textbf{S}platting for \textbf{V}irtual \textbf{E}nvironment \textbf{R}endering and \textbf{S}cene \textbf{E}diting), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, \our{} facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.

97. 【2510.11835】Data or Language Supervision: What Makes CLIP Better than DINO?

链接https://arxiv.org/abs/2510.11835

作者:Yiming Liu,Yuhui Zhang,Dhruba Ghosh,Ludwig Schmidt,Serena Yeung-Levy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:larger training data, outperforms self-supervised models, CLIP outperforms self-supervised, self-supervised models, vision-language models

备注: EMNLP 2025 Findings

点击查看摘要

Abstract:CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

98. 【2510.11817】Enhancing the Quality of 3D Lunar Maps Using JAXA's Kaguya Imagery

链接https://arxiv.org/abs/2510.11817

作者:Yumi Iwashita,Haakon Moe,Yang Cheng,Adnan Ansar,Georgios Georgakis,Adrian Stoica,Kazuto Nakashima,Ryo Kurazume,Jim Torresen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:South Pole-Aitken basin, NASA Endurance mission, NASA Endurance, Endurance mission concept, Moon intensify

备注: Presented at IEEE SMC 2025

点击查看摘要

Abstract:As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical-particularly for long-distance missions such as NASA's Endurance mission concept, in which a rover aims to traverse 2,000 km across the South Pole-Aitken basin. Kaguya TC (Terrain Camera) images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose an approach to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed approach effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.

99. 【2510.11760】Audio-Guided Visual Perception for Audio-Visual Navigation

链接https://arxiv.org/abs/2510.11760

作者:Yi Wang,Yinfeng Yu,Fuchun Sun,Liejun Wang,Wendong Zheng

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Audio-Visual Embodied Navigation, Embodied Navigation aims, Audio-Visual Embodied, Embodied Navigation, aims to enable

备注: Main paper (6 pages). Accepted for publication by International Conference on Virtual Reality and Visualization 2025 (ICVRV 2025)

点击查看摘要

Abstract:Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.

100. 【2510.11738】SeeingSounds: Learning Audio-to-Visual Alignment via Text

链接https://arxiv.org/abs/2510.11738

作者:Simone Carnemolla,Matteo Pennisi,Chiara Russo,Simone Palazzo,Daniela Giordano,Concetto Spampinato

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:paired audio-visual data, visual generative models, modular framework, leverages the interplay, vision-without requiring

备注: accepted to ACM Multimedia Asia 2025

点击查看摘要

Abstract:We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., "a distant thunder") that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.

101. 【2510.12425】nsor Completion via Monotone Inclusion: Generalized Low-Rank Priors Meet Deep Denoisers

链接https://arxiv.org/abs/2510.12425

作者:Peng Chen,Deliang Wei,Jiale Yao,Fang Li

类目:Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)

关键词:real world applications, pose significant challenges, diverse real world, Missing entries, multi dimensional data

备注: 22 pages, 5 figures

点击查看摘要

Abstract:Missing entries in multi dimensional data pose significant challenges for downstream analysis across diverse real world applications. These data are naturally modeled as tensors, and recent completion methods integrating global low rank priors with plug and play denoisers have demonstrated strong empirical performance. However, these approaches often rely on empirical convergence alone or unrealistic assumptions, such as deep denoisers acting as proximal operators of implicit regularizers, which generally does not hold. To address these limitations, we propose a novel tensor completion framework grounded in the monotone inclusion paradigm, which unifies generalized low rank priors with deep pseudo contractive denoisers and extends beyond traditional convex optimization. Building on the Davis Yin splitting scheme, we develop the GTCTV DPC algorithm and rigorously establish its global convergence. Extensive experiments demonstrate that GTCTV DPC consistently outperforms existing methods in both quantitative metrics and visual quality, particularly at low sampling rates.

102. 【2510.12141】MAPS: Masked Attribution-based Probing of Strategies- A computational framework to align human and model explanations

链接https://arxiv.org/abs/2510.12141

作者:Sabine Muzellec,Yousif Kashef Alghetaa,Simon Kornblith,Kohitij Kar

类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)

关键词:core object recognition, object recognition depends, Masked Attribution-based Probing, Human core object, visual information

备注

点击查看摘要

Abstract:Human core object recognition depends on the selective use of visual information, but the strategies guiding these choices are difficult to measure directly. We present MAPS (Masked Attribution-based Probing of Strategies), a behaviorally validated computational tool that tests whether explanations derived from artificial neural networks (ANNs) can also explain human vision. MAPS converts attribution maps into explanation-masked images (EMIs) and compares image-by-image human accuracies on these minimal images with limited pixel budgets with accuracies on the full stimuli. MAPS provides a principled way to evaluate and choose among competing ANN interpretability methods. In silico, EMI-based behavioral similarity between models reliably recovers the ground-truth similarity computed from their attribution maps, establishing which explanation methods best capture the model's strategy. When applied to humans and macaques, MAPS identifies ANN-explanation combinations whose explanations align most closely with biological vision, achieving the behavioral validity of Bubble masks while requiring far fewer behavioral trials. Because it needs only access to model attributions and a modest set of behavioral data on the original images, MAPS avoids exhaustive psychophysics while offering a scalable tool for adjudicating explanations and linking human behavior, neural activity, and model decisions under a common standard.