本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新660篇论文,其中:
- 自然语言处理156篇
- 信息检索24篇
- 计算机视觉122篇
自然语言处理
1. 【2510.14980】Agentic Design of Compositional Machines
链接:https://arxiv.org/abs/2510.14980
作者:Wenqian Zhang,Weiyang Liu,Zhen Liu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:complex machines stands, engineering practice, marker of human, human intelligence, foundation of engineering
备注: 75 pages, 31 figures, Project Page: [this https URL](https://besiegefield.github.io)
点击查看摘要
Abstract:The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
2. 【2510.14973】Attention Is All You Need for KV Cache in Diffusion LLMs
链接:https://arxiv.org/abs/2510.14973
作者:Quan Nguyen-Tri,Mukul Ranjan,Zhiqiang Shen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, adaptively recompute key-value, minimizing decoding latency, diffusion large language, language models
备注: [this https URL](https://vila-lab.github.io/elastic-cache-webpage/)
点击查看摘要
Abstract:This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
3. 【2510.14972】okDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
链接:https://arxiv.org/abs/2510.14972
作者:Yinxi Li,Yuntian Deng,Pengyu Nie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
关键词:mixed natural language, natural language text, programming language code, Large language models, byte-pair encoding
备注:
点击查看摘要
Abstract:Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
4. 【2510.14969】LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
链接:https://arxiv.org/abs/2510.14969
作者:Yiming Wang,Da Yin,Yuedong Cui,Ruichen Zheng,Zhiqian Li,Zongyu Lin,Di Wu,Xueqing Wu,Chenchen Ye,Yu Zhou,Kai-Wei Chang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:human annotation, infra and engineering, engineering perspectives, Digital agents require, generalize across real-world
备注: Preprint. Project page: [this https URL](https://ui-simulator.notion.site/llms-as-scalable-digital-world-simulator;) Code and data: [this https URL](https://github.com/WadeYin9712/UI-Simulator)
点击查看摘要
Abstract:Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.
5. 【2510.14967】Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
链接:https://arxiv.org/abs/2510.14967
作者:Guoqing Wang,Sunhao Dai,Guangze Ye,Zeyu Gan,Wei Yao,Yong Deng,Xiaofeng Wu,Zhenzhe Ying
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language model, Large language, require multi-turn reasoning, knowledge acquisition, increasingly trained
备注:
点击查看摘要
Abstract:Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
6. 【2510.14961】Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
链接:https://arxiv.org/abs/2510.14961
作者:Jonas Geiping,Xinyu Yang,Guinan Su
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:recurrent depth, repetition of layers, universal or looped, capacity to increase, Language models
备注: Code can be found at [this https URL](https://github.com/seal-rg/recurrent-pretraining)
点击查看摘要
Abstract:Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
7. 【2510.14958】MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
链接:https://arxiv.org/abs/2510.14958
作者:Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Large Multimodal Models, Language Models, excelled in textual
备注: Project Page: [this https URL](https://mathcanvas.github.io/)
点击查看摘要
Abstract:While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: this https URL
8. 【2510.14949】DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
链接:https://arxiv.org/abs/2510.14949
作者:Yu Zhou,Sohyun An,Haikang Deng,Da Yin,Clark Peng,Cho-Jui Hsieh,Kai-Wei Chang,Nanyun Peng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:rich regional variations, Contact languages, exhibit rich regional, generative models, English exhibit rich
备注:
点击查看摘要
Abstract:Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins ( 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
9. 【2510.14944】MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics
链接:https://arxiv.org/abs/2510.14944
作者:Yuxing Lu,Xukai Zhao,J. Ben Tamo,Micky C. Nnamdi,Rui Peng,Shuang Zeng,Xingyu Hu,Jinzhuo Wang,May D. Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
关键词:Large Language Models, Large Language, remains largely uncharacterized, specialized scientific domains, demonstrated remarkable capabilities
备注: 22 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.
10. 【2510.14943】LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
链接:https://arxiv.org/abs/2510.14943
作者:Wenkai Yang,Weijie Liu,Ruobing Xie,Yiju Guo,Lulu Wu,Saiyong Yang,Yankai Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Learning with Verifiable, Reinforcement Learning, Language Models
备注: Work in progress. Github repo: [this https URL](https://github.com/RUCBM/LaSeR)
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.
11. 【2510.14937】AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations
链接:https://arxiv.org/abs/2510.14937
作者:Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin
类目:Computation and Language (cs.CL)
关键词:Post-Traumatic Stress Disorder, Post-Traumatic Stress, health disorders remain, Stress Disorder, Mental health disorders
备注: 7 pages 1 figure
点击查看摘要
Abstract:Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.
12. 【2510.14936】Circuit Insights: Towards Interpretability Beyond Activations
链接:https://arxiv.org/abs/2510.14936
作者:Elena Golimblevskaia,Aakriti Jain,Bruno Puri,Ammar Ibrahim,Wojciech Samek,Sebastian Lapuschkin
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:understanding model computations, neural networks, fields of explainable, aim to uncover, uncover the internal
备注:
点击查看摘要
Abstract:The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.
13. 【2510.14925】Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
链接:https://arxiv.org/abs/2510.14925
作者:Akira Okutomi
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:reinterpret Kant Critique, Kant Critique, Critique of Pure, Pure Reason, viewing reason
备注: 19 pages, 2 figures, preliminary version
点击查看摘要
Abstract:We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing -- and selectively reducing -- overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.
14. 【2510.14922】RI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG
链接:https://arxiv.org/abs/2510.14922
作者:Annisaa Fitri Nurfidausi,Eleonora Mancini,Paolo Torroni
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
关键词:mental health disorder, widespread mental health, detection remains challenging, automatic detection remains, health disorder
备注:
点击查看摘要
Abstract:Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.
15. 【2510.14919】Predicting Task Performance with Context-aware Scaling Laws
链接:https://arxiv.org/abs/2510.14919
作者:Kyle Montgomery,David Park,Jianhong Tu,Michael Bendersky,Beliz Gunel,Dawn Song,Chenguang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:linking upstream metrics, Scaling laws, large language models, transformed our understanding, understanding of large
备注:
点击查看摘要
Abstract:Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at this https URL.
16. 【2510.14915】Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation
链接:https://arxiv.org/abs/2510.14915
作者:Xujun Peng,Anoop Kumar,Jingyu Wu,Parker Glenn,Daben Liu
类目:Computation and Language (cs.CL)
关键词:leverage Large Language, Large Language Models, Large Language, systems leverage Large, leverage Large
备注: EMNLP 2025 Industry track
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5\% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.
17. 【2510.14913】Budget-aware Test-time Scaling via Discriminative Verification
链接:https://arxiv.org/abs/2510.14913
作者:Kyle Montgomery,Sijun Tan,Yuqi Chen,Siyuan Zhuang,Tianjun Zhang,Raluca Ada Popa,Chenguang Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:complex reasoning tasks, large language models, reasoning tasks, strategy for boosting, boosting the performance
备注:
点击查看摘要
Abstract:Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at this https URL.
18. 【2510.14901】Reasoning with Sampling: Your Base Model is Smarter Than You Think
链接:https://arxiv.org/abs/2510.14901
作者:Aayush Karan,Yilun Du
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:exhibited incredible capabilities, posttraining large language, Frontier reasoning models, large language models, array of disciplines
备注:
点击查看摘要
Abstract:Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
19. 【2510.14889】Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
链接:https://arxiv.org/abs/2510.14889
作者:Soorya Ram Shimgekar,Ruining Zhao,Agam Goyal,Violeta J. Rodriguez,Paul A. Bloom,Hari Sundaram,Koustuv Saha
类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:experiencing suicidal ideation, individuals experiencing suicidal, social media, suicidal ideation, distress explicitly
备注:
点击查看摘要
Abstract:On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user's information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user's and neighbors' interactions -- integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.
20. 【2510.14885】You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
链接:https://arxiv.org/abs/2510.14885
作者:Logan Lawrence,Oindrila Saha,Megan Wei,Chen Sun,Subhransu Maji,Grant Van Horn
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, auto-regressive models remains, Multimodal Large, zero-shot visual classification
备注: Accepted to WACV26. 12 pages, 8 tables, 5 figures
点击查看摘要
Abstract:Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
21. 【2510.14871】From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR
链接:https://arxiv.org/abs/2510.14871
作者:Erwei Wang,Samuel Bayliss,Andra Bisca,Zachary Blair,Sangeeta Chowdhary,Kristof Denolf,Jeff Fifield,Brandon Freiberger,Erika Hunhoff,Phil James-Roxby,Jack Lo,Joseph Melber,Stephen Neuendorffer,Eddie Richter,Andre Rosti,Javier Setoain,Gagandeep Singh,Endri Taka,Pranathi Vasireddy,Zhewen Yu,Niansong Zhang,Jinming Zhuang
类目:Computation and Language (cs.CL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
关键词:General-purpose compilers abstract, General-purpose compilers, limiting their effectiveness, modern spatial architectures, modern computing architectures
备注:
点击查看摘要
Abstract:General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.
22. 【2510.14866】Benchmarking Multimodal Large Language Models for Face Recognition
链接:https://arxiv.org/abs/2510.14866
作者:Hatef Otroshi Shahreza,Sébastien Marcel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, achieved remarkable performance, large language models, large language
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.
23. 【2510.14865】Midtraining Bridges Pretraining and Posttraining Distributions
链接:https://arxiv.org/abs/2510.14865
作者:Emmy Liu,Graham Neubig,Chenyan Xiong
类目:Computation and Language (cs.CL)
关键词:higher quality, midtraining, Recently, language models, language models pretrained
备注:
点击查看摘要
Abstract:Recently, many language models have been pretrained with a "midtraining" phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.
24. 【2510.14853】Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models
链接:https://arxiv.org/abs/2510.14853
作者:Guinan Su,Yanwu Yang,Li Shen,Lu Yin,Shiwei Liu,Jonas Geiping
类目:Computation and Language (cs.CL)
关键词:sparse expert activation, suffer from suboptimal, due to distribution, routing decisions due, suboptimal routing decisions
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textit{a data-free, online test-time framework} that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5\% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6\% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.
25. 【2510.14846】Where to Search: Measure the Prior-Structured Search Space of LLM Agents
链接:https://arxiv.org/abs/2510.14846
作者:Zhuo-Yang Song
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
关键词:large language models, paradigm based, based on large, achieved progress, program discovery
备注: 10 pages, 2 figures, 1 table
点击查看摘要
Abstract:The generate-filter-refine (iterative) paradigm based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.
26. 【2510.14824】Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
链接:https://arxiv.org/abs/2510.14824
作者:Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:binary label prediction, relevant query-document pairs, metric learning, binary label, relevance vs. irrelevance
备注:
点击查看摘要
Abstract:In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
27. 【2510.14773】Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
链接:https://arxiv.org/abs/2510.14773
作者:Hwiyeol Jo,Joosung Lee,Jaehone Lee,Sang-Woo Lee,Joonsuk Park,Kang Min Yoo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Evaluating generative models, commonly involves question-answering, Evaluating generative, large language models, involves question-answering tasks
备注: ARR Submitted
点击查看摘要
Abstract:Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
28. 【2510.14763】COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes
链接:https://arxiv.org/abs/2510.14763
作者:Yunwen Li,Shuangshuang Ying,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Tianyu Zheng,Xeron Du,Qiguang Chen,Jiajun Shi,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Stephen Huang,Wanxiang Che,Chenghua Lin,Eli Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, lacks process-level supervision, language models exhibit, models exhibit systematic
备注:
点击查看摘要
Abstract:Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.
29. 【2510.14756】Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code
链接:https://arxiv.org/abs/2510.14756
作者:Manar Abdelatty,Maryam Nouh,Jacob K. Rosenstein,Sherief Reda
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, hardware design tasks, automate hardware design
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3\% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8\%, delay efficiency of 65.9\%, and power efficiency of 64.0\% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.
30. 【2510.14738】AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning
链接:https://arxiv.org/abs/2510.14738
作者:Mengzhao Jia,Zhihan Zhang,Ignacio Cases,Zheyuan Liu,Meng Jiang,Peng Qi
类目:Computation and Language (cs.CL)
关键词:large language models, Multimodal large language, complex multi-step reasoning, correctness is rewarded, large language
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
31. 【2510.14718】Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms
链接:https://arxiv.org/abs/2510.14718
作者:Xingmeng Zhao,Dan Schumacher,Veronica Rammouz,Anthony Rios
类目:Computation and Language (cs.CL)
关键词:rapidly transforming healthcare, mental health chatbots, enabling fast development, Artificial intelligence, wellness trackers
备注: 8 pages main + Appendix
点击查看摘要
Abstract:Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.
32. 【2510.14670】ITAN: Graph-Executable Reasoning for Cyber Threat Intelligence
链接:https://arxiv.org/abs/2510.14670
作者:Marco Simoni,Aleksandar Fontana,Andrea Saracino,Paolo Mori
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:Automated Navigation, Intelligence Through Automated, connects natural-language cyber, natural-language cyber threat, cyber threat queries
备注:
点击查看摘要
Abstract:TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.
33. 【2510.14662】Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures
链接:https://arxiv.org/abs/2510.14662
作者:Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé
类目:Computation and Language (cs.CL)
关键词:collocational meaning formed, Semantic prosody, series of collocates, collocational meaning, meaning formed
备注: 11 pages, 2 figures, *SEM workshop at EMNLP 2025 conference
点击查看摘要
Abstract:Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translations of each other may have different semantic prosody, more attention should be paid to this linguistic property to generate accurate translations. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50 models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.
34. 【2510.14660】An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs
链接:https://arxiv.org/abs/2510.14660
作者:Linyue Ma,Yilong Xu,Xiang Long,Zhi Zheng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, empowers Large Language, augmentation empowers Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.
35. 【2510.14640】Intent Clustering with Shared Pseudo-Labels
链接:https://arxiv.org/abs/2510.14640
作者:I-Fan Lin,Faegheh Hasibi,Suzan Verberne
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:makes minimal assumptions, propose an intuitive, training-free and label-free, intent clustering, clustering that makes
备注:
点击查看摘要
Abstract:In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.
36. 【2510.14628】RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF
链接:https://arxiv.org/abs/2510.14628
作者:Qing Yang,Zhenghao Liu,Junxin Wang,Yangfan Du,Pengcheng Huang,Tong Xiao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:emotional expressiveness remains, achieved near-human quality, emotional expressiveness, Large Language Model, synthesis has achieved
备注:
点击查看摘要
Abstract:Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
37. 【2510.14621】ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
链接:https://arxiv.org/abs/2510.14621
作者:Yuanyi Song,Heyuan Huang,Qiqiang Lin,Yin Zhao,Xiangmou Qu,Jun Wang,Xingyu Lou,Weiwen Liu,Zhuosheng Zhang,Jun Wang,Yong Yu,Weinan Zhang,Zhaoxiang Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:graphical user interfaces, multimodal large language, large language models, operate mobile devices, user interfaces
备注:
点击查看摘要
Abstract:The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined "golden path", while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents' performance on complex, long-horizon problems based on experimental results. Code and data are available at: this https URL.
38. 【2510.14620】Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models
链接:https://arxiv.org/abs/2510.14620
作者:Kedi Chen,Zhikai Lei,Xu Guo,Xuecheng Wu,Siyuan Zeng,Jianghao Yin,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Qipeng Guo,Kai Chen,Wei Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:make remarkable progress, Large language models, Large language, make remarkable, remarkable progress
备注:
点击查看摘要
Abstract:Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models' OOD performance.
39. 【2510.14616】Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures
链接:https://arxiv.org/abs/2510.14616
作者:Shuangshuang Ying,Yunwen Li,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Xeron Du,Tianyu Zheng,Yichi Zhang,Letian Ni,Yuyang Cheng,Qiguang Chen,Jingzhe Ding,Shengda Long,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Ge Zhang,Wenhao Huang,Wanxiang Che,Chenghua Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:exhibit significant performance, significant performance degradation, signals are removed, exhibit significant, significant performance
备注:
点击查看摘要
Abstract:Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.
40. 【2510.14591】Just-In-Time Objectives: A General Approach for Specialized AI Interactions
链接:https://arxiv.org/abs/2510.14591
作者:Michelle S. Lam,Omar Shaikh,Hallie Xu,Alice Guo,Diyi Yang,Jeffrey Heer,James A. Landay,Michael S. Bernstein
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, language models promise, drafting emails littered, Large language, set of functions
备注:
点击查看摘要
Abstract:Large language models promise a broad set of functions, but when not given a specific objective, they default to milquetoast results such as drafting emails littered with cliches. We demonstrate that inferring the user's in-the-moment objective, then rapidly optimizing for that singular objective, enables LLMs to produce tools, interfaces, and responses that are more responsive and desired. We contribute an architecture for automatically inducing just-in-time objectives by passively observing user behavior, then steering downstream AI systems through generation and evaluation against this objective. Inducing just-in-time objectives (e.g., "Clarify the abstract's research contribution") enables automatic generation of tools, e.g., those that critique a draft based on relevant HCI methodologies, anticipate related researchers' reactions, or surface ambiguous terminology. In a series of experiments (N=14, N=205) on participants' own tasks, JIT objectives enable LLM outputs that achieve 66-86% win rates over typical LLMs, and in-person use sessions (N=17) confirm that JIT objectives produce specialized tools unique to each participant.
41. 【2510.14583】alking Points: Describing and Localizing Pixels
链接:https://arxiv.org/abs/2510.14583
作者:Matan Rusanovsky,Shimon Malnick,Shai Avidan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:achieved remarkable success, achieved remarkable, remarkable success, success in cross-modal, Point
备注:
点击查看摘要
Abstract:Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on this http URL bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at this https URL.
42. 【2510.14565】Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs
链接:https://arxiv.org/abs/2510.14565
作者:Kyubyung Chae,Gihoon Kim,Gyuseong Lee,Taesup Kim,Jaejin Lee,Heejin Kim
类目:Computation and Language (cs.CL)
关键词:Recent trends, show growing interest, sovereign LLMs, growing interest, LLMs
备注:
点击查看摘要
Abstract:Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users' socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.
43. 【2510.14545】Agentic Entropy-Balanced Policy Optimization
链接:https://arxiv.org/abs/2510.14545
作者:Guanting Dong,Licheng Bao,Zhongyuan Wang,Kangzhi Zhao,Xiaoxi Li,Jiajie Jin,Jinghan Yang,Hangyu Mao,Fuzheng Zhang,Kun Gai,Guorui Zhou,Yutao Zhu,Ji-Rong Wen,Zhicheng Dou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Agentic Reinforcement Learning, long-horizon tool-use capabilities, made significant progress, Agentic Reinforcement, Reinforcement Learning
备注: Working in progress
点击查看摘要
Abstract:Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
44. 【2510.14509】E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
链接:https://arxiv.org/abs/2510.14509
作者:Jingyao Liu,Chen Huang,Zhizhao Guan,Wenqiang Lei,Yang Deng
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:multiple BDD test, BDD test scenarios, Python step implementations, fully automated testing, automated testing pipeline
备注:
点击查看摘要
Abstract:E2EDev comprises (i) a fine-grained set of user requirements, (ii) {multiple BDD test scenarios with corresponding Python step implementations for each requirement}, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). {By evaluating various E2ESD frameworks and LLM backbones with E2EDev}, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at this https URL.
45. 【2510.14504】Efficient Seq2seq Coreference Resolution Using Entity Representations
链接:https://arxiv.org/abs/2510.14504
作者:Matt Grenander,Shay B. Cohen,Mark Steedman
类目:Computation and Language (cs.CL)
关键词:requiring task-specific parameters, task-specific parameters, learning to generate, requiring task-specific, generate text
备注:
点击查看摘要
Abstract:Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.
46. 【2510.14466】LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models
链接:https://arxiv.org/abs/2510.14466
作者:Haolin Li,Haipeng Zhang,Mang Li,Yaohua Wang,Lijie Wen,Yu Zhang,Biqing Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, remains substantially lower, Linguistic Robust Anchoring, limited training data, language models
备注:
点击查看摘要
Abstract:As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.
47. 【2510.14453】Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents
链接:https://arxiv.org/abs/2510.14453
作者:Reid T. Johnson,Michelle D. Pain,Jordan D. West
类目:Computation and Language (cs.CL)
关键词:present Natural Language, replaces programmatic JSON, Natural Language Tools, natural language outputs, Natural Language
备注: 31 pages, 7 figures
点击查看摘要
Abstract:We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.
48. 【2510.14438】Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents
链接:https://arxiv.org/abs/2510.14438
作者:Rui Wang,Ce Zhang,Jun-Yu Ma,Jianshu Zhang,Hongru Wang,Yi Chen,Boyang Xue,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu,Kam-Fai Wong
类目:Computation and Language (cs.CL)
关键词:Deep research web, Deep research, deep research agents, open-source deep research, multimodal inputs
备注:
点击查看摘要
Abstract:Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
49. 【2510.14420】Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
链接:https://arxiv.org/abs/2510.14420
作者:Qingyu Ren,Qianyu He,Bowei Zhang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Language models, real-world applications, follow multi-constraint instructions, struggle to follow, crucial for real-world
备注:
点击查看摘要
Abstract:Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at this https URL
50. 【2510.14406】IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning
链接:https://arxiv.org/abs/2510.14406
作者:Xikai Zhang,Bo Wang,Likang Xiao,Yongzhi Li,Quan Chen,Wenju Wu,Liu Liu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Final Pass Rate, made significant strides, Final Pass, Pass Rate, achieve Final Pass
备注:
点击查看摘要
Abstract:Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.
51. 【2510.14400】MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering
链接:https://arxiv.org/abs/2510.14400
作者:Yingpeng Ning,Yuanyuan Sun,Ling Luo,Yanhua Wang,Yuchen Pan,Hongfei Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:requires accurate interpretation, Biomedical question answering, question answering, requires accurate, accurate interpretation
备注:
点击查看摘要
Abstract:Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.
52. 【2510.14398】Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation
链接:https://arxiv.org/abs/2510.14398
作者:Shiyao Ding,Takayuki Ito
类目:Computation and Language (cs.CL)
关键词:general next-token prediction, Large language models, excel at general, individuals truly communicate, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of "Your Next Token Prediction (YNTP)", which models a user's precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users' internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: this https URL
53. 【2510.14395】Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis
链接:https://arxiv.org/abs/2510.14395
作者:Jun Li,Qun Zhao
类目:Computation and Language (cs.CL)
关键词:public health issue, critical global public, global public health, health issue, remains a critical
备注:
点击查看摘要
Abstract:Suicide remains a critical global public health issue. While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user's evolving suicidal risk. Users, however, often reveal their intentions through historical posts and interactive comments over time. This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users' suicidal risk levels. We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users' posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.
54. 【2510.14381】Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
链接:https://arxiv.org/abs/2510.14381
作者:Andrew Zhao,Reshmi Ghosh,Vitor Carvalho,Emily Lawton,Keegan Hines,Gao Huang,Jack W. Stokes
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:Large language model, Large language, carefully designed prompts, computer-use assistants, autonomous robots
备注:
点击查看摘要
Abstract:Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $\Delta$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.
55. 【2510.14377】PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora
链接:https://arxiv.org/abs/2510.14377
作者:Mykolas Sveistrys,Richard Kunert
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:large language models, Recent advances, language models, retrieval-augmented generation, advances in large
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
56. 【2510.14369】From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program
链接:https://arxiv.org/abs/2510.14369
作者:Joseph E. Trujillo-Falcon,Monica L. Bozeman,Liam E. Llewellyn,Samuel T. Halvorson,Meryl Mizell,Stuti Deshpande,Bob Manning,Todd Fagin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:National Weather Service, English at home, Weather-Ready Nation, speak English, Weather Service
备注:
点击查看摘要
Abstract:To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.
57. 【2510.14365】On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?
链接:https://arxiv.org/abs/2510.14365
作者:Anyun Zhuo,Xuefei Ning,Ningyuan Li,Yu Wang,Pinyan Lu
类目:Computation and Language (cs.CL)
关键词:structured character-level perturbations, Unicode control characters, work investigates, investigates the resilience, resilience of contemporary
备注:
点击查看摘要
Abstract:This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.
58. 【2510.14359】AI for Service: Proactive Assistance with AI Glasses
链接:https://arxiv.org/abs/2510.14359
作者:Zichen Wen,Yiyu Wang,Chenfei Liao,Boxue Yang,Junxian Li,Weifeng Liu,Haocong He,Bolong Feng,Xuyang Liu,Yuanhuiyi Lyu,Xu Zheng,Xuming Hu,Linfeng Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:adaptive companion, daily life, active and adaptive, paradigm that enables, enables proactive
备注: 24 pages, 5 figures, work in progress
点击查看摘要
Abstract:In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.
59. 【2510.14353】CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering
链接:https://arxiv.org/abs/2510.14353
作者:Ziad Elshaer,Essam A. Rashed
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
关键词:High-performing medical Large, Large Language Models, medical Large Language, Large Language, substantial computational resources
备注:
点击查看摘要
Abstract:High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model's certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0\%) and MedMCQA (78.0\%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.
60. 【2510.14351】Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts
链接:https://arxiv.org/abs/2510.14351
作者:Perapard Ngokpol,Kun Kerdthaisong,Pasin Buakhaw,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, consistently portray version-specific, portray version-specific characters, Large language, superheroes across comic
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting"). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.
61. 【2510.14332】A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease
链接:https://arxiv.org/abs/2510.14332
作者:Yangyang Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
关键词:Alzheimer Disease, alleviating financial burden, detection of Alzheimer, health care, greatly beneficial
备注: Peer-reviewed and published in Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2020). 7 pages, 5 figures
点击查看摘要
Abstract:Early detection of Alzheimer's Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.
62. 【2510.14318】Evaluating Reducing Deceptive Dialogue From Language Models with Multi-turn RL
链接:https://arxiv.org/abs/2510.14318
作者:Marwa Abdulhai,Ryan Cheng,Aryansh Shrivastava,Natasha Jaques,Yarin Gal,Sergey Levine
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, education and healthcare, interact with millions, customer support
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
63. 【2510.14312】rrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies
链接:https://arxiv.org/abs/2510.14312
作者:Mason Nakamura,Abhinav Kumar,Saaduddin Mahmud,Sahar Abdelnabi,Shlomo Zilberstein,Eugene Bagdasarian
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:large language models, requires inter-agent collaboration, tedious user tasks, automate tedious user, powered by large
备注:
点击查看摘要
Abstract:A multi-agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter-agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine-grained study on safety, privacy, and security in LLM-based MAS. We repurpose the blackboard design, an early approach in multi-agent systems, to create a modular, configurable testbed for multi-agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework's flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi-agent systems.
64. 【2510.14307】MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking
链接:https://arxiv.org/abs/2510.14307
作者:Sathyanarayanan Ramamoorthy,Vishwa Shah,Simran Khanuja,Zaid Sheikh,Shan Jie,Ann Chia,Shearman Chua,Graham Neubig
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:paper introduces MERLIN, introduces MERLIN, Multimodal Entity Linking, Entity Linking, Multilingual Multimodal Entity
备注:
点击查看摘要
Abstract:This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at this https URL
65. 【2510.14305】MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
链接:https://arxiv.org/abs/2510.14305
作者:Mahbub E Sobhani,Md. Faiyaz Abdullah Sayeedi,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda
类目:Computation and Language (cs.CL)
关键词:structured logical deduction, numerical precision, challenging domains, domains for large, structured logical
备注:
点击查看摘要
Abstract:Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: this https URL
66. 【2510.14303】Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers
链接:https://arxiv.org/abs/2510.14303
作者:Ziye Xia,Sergei S. Ospichev
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:posed severe challenges, latest research findings, academic publications, recent years, scientists struggle
备注: 9 pages, 10 figures
点击查看摘要
Abstract:In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.
67. 【2510.14296】Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL
链接:https://arxiv.org/abs/2510.14296
作者:Md Mahadi Hasan Nahid,Davood Rafiei,Weiwei Zhang,Yong Zhang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:aligning natural language, natural language questions, database schema elements, process of aligning, aligning natural
备注: 30 Pages
点击查看摘要
Abstract:Schema linking -- the process of aligning natural language questions with database schema elements -- is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50\%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.
68. 【2510.14278】PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering
链接:https://arxiv.org/abs/2510.14278
作者:Md Mahadi Hasan Nahid,Davood Rafiei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:requires gathering multiple, gathering multiple pieces, answering complex questions, complex questions requires, questions requires gathering
备注: 18 pages
点击查看摘要
Abstract:Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.
69. 【2510.14276】Qwen3Guard Technical Report
链接:https://arxiv.org/abs/2510.14276
作者:Haiquan Zhao,Chenhan Yuan,Fei Huang,Xiaomeng Hu,Yichang Zhang,An Yang,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin,Baosong Yang,Chen Cheng,Jialong Tang,Jiandong Jiang,Jianwei Zhang,Jijie Xu,Ming Yan,Minmin Sun,Pei Zhang,Pengjun Xie,Qiaoyu Tang,Qin Zhu,Rong Zhang,Shibin Wu,Shuo Zhang,Tao He,Tianyi Tang,Tingyu Xia,Wei Liao,Weizhou Shen,Wenbiao Yin,Wenmeng Zhou,Wenyuan Yu,Xiaobin Wang,Xiaodong Deng,Xiaodong Xu,Xinyu Zhang,Yang Liu,Yeqiu Li,Yi Zhang,Yong Jiang,Yu Wan,Yuxin Zhou
类目:Computation and Language (cs.CL)
关键词:increasingly critical, large language models, capable and widely, safety, streaming LLM inference
备注:
点击查看摘要
Abstract:As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.
70. 【2510.14274】Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
链接:https://arxiv.org/abs/2510.14274
作者:Lifu Tu,Yingbo Zhou,Semih Yavuz
类目:Computation and Language (cs.CL)
关键词:presents unique challenges, unique challenges due, models presents unique, presents unique, unique challenges
备注:
点击查看摘要
Abstract:Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.
71. 【2510.14271】Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation
链接:https://arxiv.org/abs/2510.14271
作者:Yilun Zheng,Dan Yang,Jie Li,Lin Shang,Lihui Chen,Jiahao Xu,Sitao Luan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, enable large language, addressing common LLM, Graph-based RAG, systems enable large
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.
72. 【2510.14262】CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions
链接:https://arxiv.org/abs/2510.14262
作者:Zihao Fu,Ming Liao,Chris Russell,Zhenguang G. Cai
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:understood internal mechanisms, Large language models, achieved remarkable success, remain largely black, largely black boxes
备注:
点击查看摘要
Abstract:Large language models have achieved remarkable success but remain largely black boxes with poorly understood internal mechanisms. To address this limitation, many researchers have proposed various interpretability methods including mechanistic analysis, probing classifiers, and activation visualization, each providing valuable insights from different perspectives. Building upon this rich landscape of complementary approaches, we introduce CAST (Compositional Analysis via Spectral Tracking), a probe-free framework that contributes a novel perspective by analyzing transformer layer functions through direct transformation matrix estimation and comprehensive spectral analysis. CAST offers complementary insights to existing methods by estimating the realized transformation matrices for each layer using Moore-Penrose pseudoinverse and applying spectral analysis with six interpretable metrics characterizing layer behavior. Our analysis reveals distinct behaviors between encoder-only and decoder-only models, with decoder models exhibiting compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis further demonstrates functional relationship patterns between layers, with CKA similarity matrices clearly partitioning layers into three phases: feature extraction, compression, and specialization.
73. 【2510.14261】Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior
链接:https://arxiv.org/abs/2510.14261
作者:Rahul Nadkarni,Yanai Elazar,Hila Gonen,Noah A. Smith
类目:Computation and Language (cs.CL)
关键词:model behavior, present an experimental, studying the relationship, behavior, measures model behavior
备注:
点击查看摘要
Abstract:We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.
74. 【2510.14252】MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
链接:https://arxiv.org/abs/2510.14252
作者:Jihao Zhao,Zhiyuan Ji,Simin Niu,Hanyu Wang,Feiyu Xiong,Zhiyu Li
类目:Computation and Language (cs.CL)
关键词:traditional RAG paradigm, document Memories, relevant text chunks, RAG paradigm, traditional RAG
备注:
点击查看摘要
Abstract:The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
75. 【2510.14242】Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs
链接:https://arxiv.org/abs/2510.14242
作者:Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Morteza Dehghani
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, produce inconsistent answers, produce inconsistent
备注: 14 pages, 6 figures, 3 tables, and 1 algorithm
点击查看摘要
Abstract:Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at this https URL.
76. 【2510.14232】Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
链接:https://arxiv.org/abs/2510.14232
作者:Mehrzad Samadi,Aleksander Ficek,Sean Narenthiran,Siddhartha Jain,Wasi Uddin Ahmad,Somshubra Majumdar,Vahid Noroozi,Boris Ginsburg
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, Competitive programming, problem-solving capabilities, capabilities of large, large language
备注: 14 pages, 11 figures
点击查看摘要
Abstract:Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.
77. 【2510.14211】LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning
链接:https://arxiv.org/abs/2510.14211
作者:Beomseok Kang,Jiwon Song,Jae-Joon Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:small language models, decomposing complex problems, sequential sub-stages, effective strategy, strategy for enhancing
备注:
点击查看摘要
Abstract:Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
78. 【2510.14205】DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans
链接:https://arxiv.org/abs/2510.14205
作者:Bingsheng Yao,Bo Sun,Yuanzhe Dong,Yuxuan Lu,Dakuo Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:emerging large language, model role-playing agents, large language model, language model role-playing, simulate individual human
备注: In Submission
点击查看摘要
Abstract:The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these this http URL evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie this http URL can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and this http URL work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.
79. 【2510.14203】Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition
链接:https://arxiv.org/abs/2510.14203
作者:Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:recently attracted attention, joint modeling method, Big, long been studied, attention in psychology
备注: Accepted at APSIPA ASC 2025
点击查看摘要
Abstract:This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.
80. 【2510.14200】RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
链接:https://arxiv.org/abs/2510.14200
作者:Zhichao Wang,Andy Wong,Ruslan Belkin
类目:Computation and Language (cs.CL)
关键词:mitigate undesired responses, improve reasoning capability, enable efficient domain, efficient domain adaptation, SFT
备注:
点击查看摘要
Abstract:After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model's instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT's 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.
81. 【2510.14184】MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation
链接:https://arxiv.org/abs/2510.14184
作者:Mahmood Hegazy,Aaron Rodrigues,Azzam Naeem
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:configurable multi-agent collaboration, enterprise-scale annotation workflows, transforms enterprise-scale annotation, transforms enterprise-scale, workflows through configurable
备注:
点击查看摘要
Abstract:We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA's effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.
82. 【2510.14163】owards Reversible Model Merging For Low-rank Weights
链接:https://arxiv.org/abs/2510.14163
作者:Mohammadsajad Alipour,Mohammad Mohammadi Amiri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:combine multiple fine-tuned, multiple fine-tuned models, Model merging aims, Model, aims to combine
备注:
点击查看摘要
Abstract:Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks. While prior work has shown that merging can approximate the performance of individual fine-tuned models for each task, it largely overlooks scenarios where models are compressed into low-rank representations, either through low-rank adaptation (LoRA) or post-training singular value decomposition (SVD). We first demonstrate that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model. Motivated by this phenomenon, we propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis (e.g., an equivalent of holding two or more models) from which original task-specific models can be recovered via linear combination. This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. Crucially, this allows us to ``revert'' to each individual model when needed, recognizing that no merged model can consistently outperform one specialized for its task. Building on this insight, we introduce our method, Reversible Model Merging (RMM), an efficient, data-free, and flexible method that provides a closed-form solution for selecting the optimal basis of model weights and task-specific coefficients for linear combination. Extensive experiments across diverse datasets and model scales demonstrate that RMM consistently outperforms existing merging approaches, preserving the performance of low-rank compressed models by a significant margin.
83. 【2510.14128】Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis
链接:https://arxiv.org/abs/2510.14128
作者:Darko Sasanski,Dimitar Peshevski,Riste Stojanov,Dimitar Trajanov
类目:Computation and Language (cs.CL)
关键词:Computational gastronomy increasingly, gastronomy increasingly relies, Computational gastronomy, capture regional culinary, high-quality recipe datasets
备注:
点击查看摘要
Abstract:Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.
84. 【2510.14113】oward Cybersecurity-Expert Small Language Models
链接:https://arxiv.org/abs/2510.14113
作者:Matan Levi,Daniel Ohayon,Ariel Blobstein,Ravid Sagi,Ian Molloy,Yair Allouche
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:transforming everyday applications, Large language models, Large language, cybersecurity lags due, everyday applications
备注:
点击查看摘要
Abstract:Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.
85. 【2510.14110】DROID: Dual Representation for Out-of-Scope Intent Detection
链接:https://arxiv.org/abs/2510.14110
作者:Wael Rashwan,Hossam M. Zawbaa,Sourav Dutta,Haytham Assem
类目:Computation and Language (cs.CL)
关键词:user utterances remains, open-set intent recognition, Transformer-based Denoising Autoencoder, user utterances, Universal Sentence Encoder
备注: 14 pages, 6 figures, 4 Tables. Preprint submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
点击查看摘要
Abstract:Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders -- the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6--15% for known and 8--20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.
86. 【2510.14106】Generating Fair Consensus Statements with Social Choice on Token-Level MDPs
链接:https://arxiv.org/abs/2510.14106
作者:Carter Blair,Kate Larson
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
关键词:diverse free-form opinions, aggregating diverse free-form, Markov Decision Process, token-level Markov Decision, inherent structure needed
备注:
点击查看摘要
Abstract:Current frameworks for consensus statement generation with large language models lack the inherent structure needed to provide provable fairness guarantees when aggregating diverse free-form opinions. We model the task as a multi-objective, token-level Markov Decision Process (MDP), where each objective corresponds to an agent's preference. Token-level rewards for each agent are derived from their policy (e.g., a personalized language model). This approach utilizes the finding that such policies implicitly define optimal Q-functions, providing a principled way to quantify rewards at each generation step without a value function (Rafailov et al., 2024). This MDP formulation creates a formal structure amenable to analysis using principles from social choice theory. We propose two approaches grounded in social choice theory. First, we propose a stochastic generation policy guaranteed to be in the ex-ante core, extending core stability concepts from voting theory to text generation. This policy is derived from an underlying distribution over complete statements that maximizes proportional fairness (Nash Welfare). Second, for generating a single statement, we target the maximization of egalitarian welfare using search algorithms within the MDP framework. Empirically, experiments using language models to instantiate agent policies show that search guided by the egalitarian objective generates consensus statements with improved worst-case agent alignment compared to baseline methods, including the Habermas Machine (Tessler et al., 2024).
87. 【2510.14077】ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models
链接:https://arxiv.org/abs/2510.14077
作者:Haziq Mohammad Khalid,Athikash Jeyaganthan,Timothy Do,Yicheng Fu,Sean O'Brien,Vasu Sharma,Kevin Zhu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, suffer significant performance, Large Language, suffer significant, significant performance degradation
备注: 14 pages, 5 figures
点击查看摘要
Abstract:Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.
88. 【2510.14040】Quantifying Phonosemantic Iconicity Distributionally in 6 Languages
链接:https://arxiv.org/abs/2510.14040
作者:George Flint,Kaustubh Kislay
类目:Computation and Language (cs.CL)
关键词:largely arbitrary, commonly theorized, systematic relationships, Abstract, systematic relationships manifest
备注: 7 pages, 2 figures, under review -- ACL (AACL 2025)
点击查看摘要
Abstract:Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations--both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes' phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.
89. 【2510.14030】hink Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games
链接:https://arxiv.org/abs/2510.14030
作者:César Guerra-Solano,Zhuochun Li,Xiang Lorraine Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:reasoning capabilities due, Large language models, Large language, capabilities due, reasoning
备注: EMNLP Main 2025
点击查看摘要
Abstract:Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
90. 【2510.14014】CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models
链接:https://arxiv.org/abs/2510.14014
作者:Shehenaz Hossain,Haithem Afli
类目:Computation and Language (cs.CL)
关键词:Correct answers, reflect cultural understanding, necessarily reflect cultural, necessarily reflect, Correct
备注:
点击查看摘要
Abstract:Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.
91. 【2510.13998】BitNet Distillation
链接:https://arxiv.org/abs/2510.13998
作者:Xun Wu,Shaohan Huang,Wenhui Wang,Ting Song,Li Dong,Yan Xia,Furu Wei
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:achieving strong task-specific, minimal computational cost, present BitNet Distillation, specific downstream tasks, strong task-specific performance
备注: 12 pages, 4 figures
点击查看摘要
Abstract:In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at this https URL.
92. 【2510.13996】he German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
链接:https://arxiv.org/abs/2510.13996
作者:Lukas Gienapp,Christopher Schröder,Stefan Schweter,Christopher Akiki,Ferdinand Schlatt,Arden Zimmermann,Phillipe Genêt,Martin Potthast
类目:Computation and Language (cs.CL)
关键词:Large language model, unclear licensing status, Large language, large-scale training corpora, German Commons
备注: 13 pages, 3 figures, 12 tables, includes datasheet
点击查看摘要
Abstract:Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.
93. 【2510.13979】Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks
链接:https://arxiv.org/abs/2510.13979
作者:Supriti Sinhamahapatra,Jan Niehues
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:systems primarily rely, Automatic Speech Recognition, additional multi-modal context, Speech Recognition, disregarding additional multi-modal
备注:
点击查看摘要
Abstract:State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2510.13979 [cs.AI]
(or
arXiv:2510.13979v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2510.13979
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
94. 【2510.13975】Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems
链接:https://arxiv.org/abs/2510.13975
作者:Kin Kwan Leung,Mouloud Belbahri,Yi Sui,Alex Labach,Xueying Zhang,Stephen Rose,Jesse C. Cresswell
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:external knowledge databases, building LLM-based question-answering, Retrieval-augmented generation, LLM-based question-answering systems, knowledge databases
备注: 8 pages
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at this https URL.
95. 【2510.13940】Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
链接:https://arxiv.org/abs/2510.13940
作者:Zhen Yang,Mingyang Zhang,Feng Chen,Ganggui Ding,Liang Hou,Xin Tao,Pengfei Wan,Ying-Cong Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:increased inference computation, Recent progress, large language models, inference computation, cost of efficiency
备注: Code: [this https URL](https://github.com/EnVision-Research/MTI)
点击查看摘要
Abstract:Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.
96. 【2510.13939】Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
链接:https://arxiv.org/abs/2510.13939
作者:Tuhin Chakrabarty,Jane C. Ginsburg,Paramveer Dhillon
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:generate derivative content, derivative content, copyrighted books, books for training, led to numerous
备注: Preprint Under Review
点击查看摘要
Abstract:The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content. Yet it's unclear if these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p10^-8) writing quality (OR=0.13, p10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p10^-13) writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
97. 【2510.13936】FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis
链接:https://arxiv.org/abs/2510.13936
作者:Fengbin Zhu,Xiang Yao Ng,Ziyang Liu,Chang Liu,Xianwei Zeng,Chao Wang,Tianhui Tan,Xuan Yao,Pengyang Shao,Min Xu,Zixuan Wang,Jing Wang,Xin Lin,Junfeng Li,Jingxian Zhu,Yang Zhang,Wenjie Wang,Fuli Feng,Richang Hong,Huanbo Luan,Ke-Wei Huang,Tat-Seng Chua
类目:Computation and Language (cs.CL)
关键词:Large Language Models, advanced Large Language, recently garnered increasing, garnered increasing attention, complex research tasks
备注:
点击查看摘要
Abstract:Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent's capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents' capabilities in corporate financial analysis. This framework mirrors the professional analyst's workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.
98. 【2510.13935】Big Reasoning with Small Models: Instruction Retrieval at Inference Time
链接:https://arxiv.org/abs/2510.13935
作者:Kenan Alkiek,David Jurgens,Vinod Vydiswaran
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:bring large-scale reasoning, local-scale compute, bring large-scale, MMLU Professional Law, reasoning
备注:
点击查看摘要
Abstract:Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.
99. 【2510.13931】Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions
链接:https://arxiv.org/abs/2510.13931
作者:Siying Liu,Shisheng Zhang,Indu Bala
类目:Computation and Language (cs.CL)
关键词:Large language models, prediction remains underexplored, drug-safety prediction remains, Large language, Administration Adverse Event
备注: Preprint of a paper accepted as a poster at the NeurIPS 2025 Workshop on Generative AI for Health (GenAI4Health). The final camera-ready workshop version may differ. Licensed under CC BY 4.0
点击查看摘要
Abstract:Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.
100. 【2510.13928】LLMs Can Get "Brain Rot"!
链接:https://arxiv.org/abs/2510.13928
作者:Shuo Xing,Junyuan Hong,Yifan Wang,Runjin Chen,Zhenyu Zhang,Ananth Grama,Zhengzhong Tu,Zhangyang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Brain Rot Hypothesis, web text induces, text induces lasting, junk web text, Rot Hypothesis
备注:
点击查看摘要
Abstract:We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' $g0.3$) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0\%$ to $100\%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine "cognitive health checks" for deployed LLMs.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.13928 [cs.CL]
(or
arXiv:2510.13928v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.13928
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
101. 【2510.13926】BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs
链接:https://arxiv.org/abs/2510.13926
作者:Congying Liu,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui
类目:Computation and Language (cs.CL)
关键词:gene regulatory mechanisms, deep understanding, understanding of specialized, specialized knowledge, gene regulatory
备注:
点击查看摘要
Abstract:Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: this https URL
102. 【2510.13925】An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation
链接:https://arxiv.org/abs/2510.13925
作者:Daniel Adu Worae,Spyridon Mastorakis
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
关键词:Internet of Things, networks generate diverse, generate diverse, diverse and high-volume, reflects both normal
备注:
点击查看摘要
Abstract:Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.
103. 【2510.13922】LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding
链接:https://arxiv.org/abs/2510.13922
作者:Mohammad Mansoori,Amira Soliman,Farzaneh Etminani
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:unstructured text provided, Clinical notes, patient encounters, unstructured text, text provided
备注:
点击查看摘要
Abstract:Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.
104. 【2510.13920】FACTS: Table Summarization via Offline Template Generation with Agentic Workflows
链接:https://arxiv.org/abs/2510.13920
作者:Ye Yuan,Mohammad Amin Shabani,Siqi Liu
类目:Computation and Language (cs.CL)
关键词:tabular data conditioned, summarization requires generating, requires generating natural, user query, enabling users
备注: Under Review
点击查看摘要
Abstract:Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.
105. 【2510.13918】Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
链接:https://arxiv.org/abs/2510.13918
作者:Peng Kuang,Yanli Wang,Xiaoyu Han,Yaowenqi Liu,Kaidi Xu,Haohan Wang
类目:Computation and Language (cs.CL)
关键词:Process reward models, Process reward, large language models, designed to verify, verify and select
备注:
点击查看摘要
Abstract:Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
106. 【2510.13916】Element2Vec: Build Chemical Element Representation from Text for Property Prediction
链接:https://arxiv.org/abs/2510.13916
作者:Yuanhao Li,Keyuan Lai,Tianqi Wang,Qihao Liu,Jiawei Ma,Yuan-Chao Hu
类目:Computation and Language (cs.CL)
关键词:measure directly due, Accurate property data, Accurate property, equipment constraints, difficult to measure
备注:
点击查看摘要
Abstract:Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.
107. 【2510.13915】Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models
链接:https://arxiv.org/abs/2510.13915
作者:Ivan Lee,Taylor Berg-Kirkpatrick
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Recent studies suggest, generate surprisingly coherent, Recent studies, surprisingly coherent text, child-directed corpora
备注: Accepted to COLM 2025 (Spotlight)
点击查看摘要
Abstract:Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.
108. 【2510.13913】Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms
链接:https://arxiv.org/abs/2510.13913
作者:Shrey Pandit,Xuan-Phi Nguyen,Yifei Ming,Austin Xu,Jiayu Wang,Caiming Xiong,Shafiq Joty
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:solve complex question, deep research, online tools, aim to solve, solve complex
备注: Preprint. ICLR 26 submission
点击查看摘要
Abstract:Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.
109. 【2510.13912】AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
链接:https://arxiv.org/abs/2510.13912
作者:María Victoria Carro,Denise Alejandra Mester,Facundo Nieto,Oscar Agustín Stanchi,Guido Ernesto Bergman,Mario Alejandro Leiva,Eitan Sprejer,Luca Nicolás Forziati Gangi,Francisca Gauna Selasco,Juan Gustavo Corvalán,Gerardo I. Simari,María Vanina Martinez
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:scalable oversight technique, prior beliefs, lie convincingly, core premise, scalable oversight
备注: 31 pages
点击查看摘要
Abstract:The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.
110. 【2510.13910】RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems
链接:https://arxiv.org/abs/2510.13910
作者:Jingru Lin,Chen Zhang,Stephen Y. Liu,Haizhou Li
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, mitigates key limitations, retrieving external information, hallucinations-by dynamically retrieving
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.
111. 【2510.13909】Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning
链接:https://arxiv.org/abs/2510.13909
作者:Xingrui Zhuo,Jiapu Wang,Gongqing Wu,Zhongyuan Wang,Jichen Zhang,Shirui Pan,Xindong Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Graph Foundation Models, Knowledge Graph Foundation, Knowledge Graph Reasoning, Inductive Knowledge Graph, LLM knowledge
备注:
点击查看摘要
Abstract:Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at this https URL in both zero-shot reasoning and fine-tuning scenarios.
112. 【2510.13908】Interpreting the Latent Structure of Operator Precedence in Language Models
链接:https://arxiv.org/abs/2510.13908
作者:Dharunish Yugeswardeenoo,Harshil Nukala,Cole Blondin,Sean O Brien,Vasu Sharma,Kevin Zhu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, demonstrated impressive reasoning, impressive reasoning capabilities, Language Models
备注: 9 pages, 4 figures. Accepted to INTERPLAY Workshop at COLM 2025
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator's embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.
113. 【2510.13907】LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
链接:https://arxiv.org/abs/2510.13907
作者:Yuanchen Wu,Saurabh Verma,Justin Lee,Fangzhou Xiong,Poppy Zhang,Amel Awadelkarim,Xu Chen,Yubai Yuan,Shawndra Hill
类目:Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:Large language models, Large language, making prompt design, language models, central challenge
备注:
点击查看摘要
Abstract:Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
114. 【2510.13905】Schema for In-Context Learning
链接:https://arxiv.org/abs/2510.13905
作者:Pan Chen,Shaohong Chen,Mark Wang,Shi Xuan Leong,Priscilla Fung,Varinia Bernales,Alan Aspuru-Guzik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:enables transformer-based language, enables transformer-based, tasks by conditioning, In-Context Learning, example-driven in-context learning
备注:
点击查看摘要
Abstract:In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model's reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.
115. 【2510.13902】Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory
链接:https://arxiv.org/abs/2510.13902
作者:Nicole Smith-Vaniz,Harper Lyon,Lorraine Steigner,Ben Armstrong,Nicholas Mattei
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large Language Models, Large Language, Language Models, Moral Foundations Theory, personal relationships
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.
116. 【2510.13901】RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
链接:https://arxiv.org/abs/2510.13901
作者:Tuan T. Nguyen,John Le,Thai T. Vu,Willy Susilo,Heath Cooper
类目:Computation and Language (cs.CL)
关键词:Large language models, bypass safety mechanisms, Large language, achieve impressive performance, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
117. 【2510.13900】Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
链接:https://arxiv.org/abs/2510.13900
作者:Julian Minder,Clément Dumas,Stewart Slocum,Helena Casademunt,Cameron Holmes,Robert West,Neel Nanda
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:adapt Large Language, Large Language Models, Large Language, adapt Large, Language Models
备注:
点击查看摘要
Abstract:Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.
118. 【2510.13898】Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges
链接:https://arxiv.org/abs/2510.13898
作者:Misam Abbas
类目:Computation and Language (cs.CL)
关键词:large language models, rivals human writing, Attributing authorship, machine-generated prose rivals, LLM judge
备注: Accepted for publication at the 2025 IEEE ICDM Workshop on "Grounding Documents with Reasoning, Agents, Retrieval, and Attribution". This is author submitted version. Not yet published
点击查看摘要
Abstract:Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.
119. 【2510.13893】Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection
链接:https://arxiv.org/abs/2510.13893
作者:Olga E. Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Daniele Nardi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Jailbreaking techniques pose, safety of Large, Large Language, Jailbreaking techniques
备注:
点击查看摘要
Abstract:Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.
120. 【2510.13892】he Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data
链接:https://arxiv.org/abs/2510.13892
作者:Zhaoyang Shang,Sibo Wei,Jianbin Guo,Rui Zhou,Lifeng Dong,Yin Luo
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, high-quality supervised fine-tuning, specialized domains relies, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on this https URL.
121. 【2510.13890】A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness
链接:https://arxiv.org/abs/2510.13890
作者:Fali Wang,Jihai Chen,Shuhua Yang,Ali Al-Lawati,Linli Tang,Hui Liu,Suhang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, limited edge deployability, high fine-tuning costs, face high fine-tuning, Small language models
备注: 17 pages, 17 figures, under review
点击查看摘要
Abstract:Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs' specialization and efficiency with LLMs' generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.
122. 【2510.13888】Reliable Fine-Grained Evaluation of Natural Language Math Proofs
链接:https://arxiv.org/abs/2510.13888
作者:Wenjie Ma,Andrei Cojocaru,Neel Kolhe,Bradley Louie,Robin Said Sharif,Haihan Zhang,Vincent Zhuang,Matei Zaharia,Sewon Min
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:verifying natural language, verifiable final answers, easily verifiable final, natural language math, large language models
备注: 31 pages, 6 figures, 10 tables
点击查看摘要
Abstract:Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
123. 【2510.13885】Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization
链接:https://arxiv.org/abs/2510.13885
作者:Ariel Kamen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Interactive Advertising Bureau, Advertising Bureau, Interactive Advertising, study presents, presents a comparative
备注: 10 pages, 4 figures,
点击查看摘要
Abstract:This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.
Comments:
10 pages, 4 figures,
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.13885 [cs.CL]
(or
arXiv:2510.13885v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.13885
Focus to learn more
arXiv-issued DOI via DataCite</p>
124. 【2510.13884】oo Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation
链接:https://arxiv.org/abs/2510.13884
作者:Bolei Ma,Yong Cao,Indira Sen,Anna-Carolina Haensch,Frauke Kreuter,Barbara Plank,Daniel Hershcovich
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, simulate public opinion, simulate public
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.
125. 【2510.13880】PAGE: Prompt Augmentation for text Generation Enhancement
链接:https://arxiv.org/abs/2510.13880
作者:Mauro Jose Pacchiotti,Luciana Ballejos,Mariel Ale
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:shown outstanding performance, natural language generative, recent years, natural language, shown outstanding
备注: in Spanish language
点击查看摘要
Abstract:In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.
126. 【2510.13879】Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
链接:https://arxiv.org/abs/2510.13879
作者:Alexandre Galashov,Matt Jones,Rosemary Ke,Yuan Cao,Vaishnavh Nagarajan,Michael C. Mozer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:model, supervised training objectives, additional compute steps, textit, dynamically and autonomously
备注:
点击查看摘要
Abstract:We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a don't know output. If the model is granted a delay, a specialized pause token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use don't know outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.
127. 【2510.13878】xtBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
链接:https://arxiv.org/abs/2510.13878
作者:Jimin Lim,Arjun Damerla,Arthur Jiang,Nam Le
类目:Computation and Language (cs.CL)
关键词:make sequential decisions, Large language models, language remains underexplored, performing reasoning tasks, natural language remains
备注: COLM 2025 @ ORIGen Workshop
点击查看摘要
Abstract:Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.
128. 【2510.13876】What Layers When: Learning to Skip Compute in LLMs with Residual Gates
链接:https://arxiv.org/abs/2510.13876
作者:Filipe Laitenberger,Dawid Kopiczko,Cees G.M. Snoek,Yuki M. Asano
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:simple residual-stream gating, residual-stream gating mechanism, enables token-wise layer, token-wise layer skipping, introduce GateSkip
备注: Preprint
点击查看摘要
Abstract:We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. For increasingly larger models, this tradeoff improves drastically. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
129. 【2510.13873】FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation
链接:https://arxiv.org/abs/2510.13873
作者:Johann Pignat,Milena Vucetic,Christophe Gaudet-Blavignac,Jamil Zaghir,Amandine Stettler,Fanny Amrein,Jonatan Bonjour,Jean-Philippe Goldman,Olivier Michielin,Christian Lovis,Mina Bjelogrlic
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Developing natural language, resources remain scarce, natural language processing, language processing tools, Developing natural
备注:
点击查看摘要
Abstract:Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.
130. 【2510.13871】Quechua Speech Datasets in Common Voice: The Case of Puno Quechua
链接:https://arxiv.org/abs/2510.13871
作者:Elwin Huaman,Wendi Huaman,Jorge Luis Huaman,Ninfa Quispe
类目:Computation and Language (cs.CL)
关键词:Common Voice, resource scarcity, hindering their development, Puno Quechua, Common Voice presents
备注: to be published in the 9th Annual International Conference on Information Management and Big Data (SIMBig 2025)
点击查看摘要
Abstract:Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.
131. 【2510.13870】Unlocking the Potential of Diffusion Language Models through Template Infilling
链接:https://arxiv.org/abs/2510.13870
作者:Junhoo Lee,Seungyeon Kim,Nojun Kwak
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Diffusion Language Models, Autoregressive Language Models, Language Models, Diffusion Language, Autoregressive Language
备注:
点击查看摘要
Abstract:Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs' generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$\%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.
132. 【2510.13862】Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues
链接:https://arxiv.org/abs/2510.13862
作者:Chenyu Zhang,Sharifa Alghowinem,Cynthia Breazeal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:remain insufficiently understood, LLM-mediated tutoring remain, tutoring remain insufficiently, large language model, educational contexts
备注: 4 pages, 3 figures. Published in the 11th International Conference on Affective Computing and Intelligent Interaction (ACII 2025), Late-Breaking Results Track
点击查看摘要
Abstract:While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.
133. 【2510.13860】ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing
链接:https://arxiv.org/abs/2510.13860
作者:Shivanshu Kumar,Gopalakrishnan Srinivasan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:language processing tasks, natural language processing, models impose substantial, impose substantial memory, processing tasks
备注:
点击查看摘要
Abstract:While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.
134. 【2510.13856】Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
链接:https://arxiv.org/abs/2510.13856
作者:A H M Rezaul Karim,Ozlem Uzuner
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Question Answering, Medical Visual Question, Question Answering, enables natural language, support clinical decision-making
备注:
点击查看摘要
Abstract:Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.
135. 【2510.13855】Harnessing Consistency for Robust Test-Time LLM Ensemble
链接:https://arxiv.org/abs/2510.13855
作者:Zhichen Zeng,Qi Yu,Xiao Lin,Ruizhong Qiu,Xuying Ning,Tianxin Wei,Yuchen Yan,Jingrui He,Hanghang Tong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:exhibit diverse strengths, large language models, LLM ensemble serves, strengths and weaknesses, complementary capabilities
备注: 15 pages, 12 figures
点击查看摘要
Abstract:Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.
136. 【2510.13854】R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging
链接:https://arxiv.org/abs/2510.13854
作者:Mamadou K. Keita,Christopher Homan,Sebastien Diarra
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:network training objective, linguistic rules directly, neural network training, training objective, hybrid approach
备注:
点击查看摘要
Abstract:We introduce the Rule-to-Tag (R2T) framework, a hybrid approach that integrates a multi-tiered system of linguistic rules directly into a neural network's training objective. R2T's novelty lies in its adaptive loss function, which includes a regularization term that teaches the model to handle out-of-vocabulary (OOV) words with principled uncertainty. We frame this work as a case study in a paradigm we call principled learning (PrL), where models are trained with explicit task constraints rather than on labeled examples alone. Our experiments on Zarma part-of-speech (POS) tagging show that the R2T-BiLSTM model, trained only on unlabeled text, achieves 98.2% accuracy, outperforming baselines like AfriBERTa fine-tuned on 300 labeled sentences. We further show that for more complex tasks like named entity recognition (NER), R2T serves as a powerful pre-training step; a model pre-trained with R2T and fine-tuned on just 50 labeled sentences outperformes a baseline trained on 300.
137. 【2510.13853】BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
链接:https://arxiv.org/abs/2510.13853
作者:Fabian Wenz,Omar Bouattour,Devin Yang,Justin Choi,Cecil Gregg,Nesime Tatbul,Çağatay Demiralp
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC)
关键词:successfully applied, natural language, large private enterprise, SQL logs, querying large private
备注: CIDR'26
点击查看摘要
Abstract:Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at this https URL and is also accessible on our website at this http URL.
138. 【2510.13852】ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups
链接:https://arxiv.org/abs/2510.13852
作者:Peter Banyas,Shristi Sharma,Alistair Simmons,Atharva Vispute
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:LLM telling, telling, factual consistency, LLM, consistency
备注: For associated code repository, see [this http URL](http://github.com/banyasp/consistencyAI) For user-friendly web app, see [this http URL](http://v0-llm-comparison-webapp.vercel.app/)
点击查看摘要
Abstract:Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
139. 【2510.13851】EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing
链接:https://arxiv.org/abs/2510.13851
作者:Sicheng Lyu,Yu Gu,Xinyu Wang,Jerry Huang,Sitao Luan,Yufei Cui,Xiao-Wen Chang,Peng Lu
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, Large language, require continual updates, require continual, rectify outdated
备注:
点击查看摘要
Abstract:Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.
140. 【2510.13850】Revisiting the UID Hypothesis in LLM Reasoning Traces
链接:https://arxiv.org/abs/2510.13850
作者:Minju Gwak,Guijin Son,Jaehyung Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, Uniform Information Density, hard to interpret, solve problems
备注:
点击查看摘要
Abstract:Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics -- which posits that humans communicate by maintaining a stable flow of information -- we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.
141. 【2510.13849】Language steering in latent space to mitigate unintended code-switching
链接:https://arxiv.org/abs/2510.13849
作者:Andrey Goncharov,Nikolai Kondusov,Alexey Zaytsev
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Multilingual Large Language, Multilingual Large, Large Language Models, exhibit unintended code-switching, reducing reliability
备注:
点击查看摘要
Abstract:Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.
142. 【2510.13848】On-device System of Compositional Multi-tasking in Large Language Models
链接:https://arxiv.org/abs/2510.13848
作者:Ondrej Bohdal,Konstantinos Theodosiadis,Asterios Mpatziakas,Dimitris Filippidis,Iro Spyrou,Christos Zonios,Anastasios Drosou,Dimosthenis Ioannidis,Kyeng-Hun Lee,Jijoong Moon,Hyeonmok Ko,Mete Ozay,Umberto Michieli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, diverse downstream tasks, parameter-efficient fine-tuning techniques, language models
备注: Accepted at EMNLP 2025 (industry track)
点击查看摘要
Abstract:Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.
143. 【2510.13847】DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
链接:https://arxiv.org/abs/2510.13847
作者:Jinbin Zhang,Nasib Ullah,Erik Schultheis,Rohit Babbar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:accelerate LLM inference, Speculative decoding, speculative sampling, large target model, target model verifies
备注:
点击查看摘要
Abstract:Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.
144. 【2510.13843】Serialized EHR make for good text representations
链接:https://arxiv.org/abs/2510.13843
作者:Zhirong Chou,Quan Qin,Shi Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Electronic Health Records, scale clinical data, large scale clinical, learning generalizable representations, opened new avenues
备注:
点击查看摘要
Abstract:The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.
145. 【2510.13842】ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking
链接:https://arxiv.org/abs/2510.13842
作者:Yutao Wu,Xiao Liu,Yinghui Li,Yifeng Gao,Yifan Ding,Jiale Ding,Xiang Zheng,Xingjun Ma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Large Language Models, tricking Large Language, Language Models, Large Language, producing attacker-controlled outputs
备注:
点击查看摘要
Abstract:Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.
146. 【2510.13839】Meronymic Ontology Extraction via Large Language Models
链接:https://arxiv.org/abs/2510.13839
作者:Dekai Zhang,Simone Conia,Antonio Rago
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:today digital age, essential in today, today digital, digital age, organising the vast
备注:
点击查看摘要
Abstract:Ontologies have become essential in today's digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.
147. 【2510.13837】Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection
链接:https://arxiv.org/abs/2510.13837
作者:Weibin Cai,Reza Zafarani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
关键词:Hate speech detection, considered hate vary, extensively studied, real-world complexity, speech detection
备注:
点击查看摘要
Abstract:Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals' hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05\% on average across all metrics.
148. 【2510.13836】SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models
链接:https://arxiv.org/abs/2510.13836
作者:Debarun Bhattacharjya,Balaji Ganesan,Junkyu Lee,Radu Marinescu,Katsiaryna Mirylenka,Michael Glass,Xiao Shou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, large language, LLM generated output, LLM, language model
备注: 15 pages including appendix, Findings of EMNLP 2025
点击查看摘要
Abstract:When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.
149. 【2510.13835】ConDABench: Interactive Evaluation of Language Models for Data Analysis
链接:https://arxiv.org/abs/2510.13835
作者:Avik Dutta,Priyanshu Gupta,Hosein Hasanbeig,Rahul Pratap Singh,Harshit Nigam,Sumit Gulwani,Arjun Radhakrishna,Gustavo Soares,Ashish Tiwari
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Real-world data analysis, Real-world data, data analysis, under-specified goals, goals and unclean
备注:
点击查看摘要
Abstract:Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.
150. 【2510.13832】Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
链接:https://arxiv.org/abs/2510.13832
作者:Minsik Choi,Hyegang Son,Changhoon Kim,Young Geun Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:NLP tasks, achieved remarkable performance, performance in NLP, Transformer-based models, achieved remarkable
备注: 32 pages
点击查看摘要
Abstract:Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.
151. 【2510.13831】Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference
链接:https://arxiv.org/abs/2510.13831
作者:Chao Han,Yijuan Liang,Zihao Xuan,Daokuan Wu,Wei Zhang,Xiaoyu Shen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:high inference cost, inference cost, deployment of large, real-world applications, applications is increasingly
备注:
点击查看摘要
Abstract:The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: this https URL
152. 【2510.13830】Users as Annotators: LLM Preference Learning from Comparison Mode
链接:https://arxiv.org/abs/2510.13830
作者:Zhongze Cai,Xiaocheng Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, Pairwise preference data, played an important, important role, large language
备注:
点击查看摘要
Abstract:Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.
153. 【2510.13829】A Linguistics-Aware LLM Watermarking via Syntactic Predictability
链接:https://arxiv.org/abs/2510.13829
作者:Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reliable governance tools, continue to advance, advance rapidly, reliable governance, governance tools
备注:
点击查看摘要
Abstract:As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at this https URL.
154. 【2510.13828】From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening
链接:https://arxiv.org/abs/2510.13828
作者:Ratna Kandala,Akshata Kishore Moharir,Divya Arvinda Nayak
类目:Computation and Language (cs.CL)
关键词:Explainable Artificial Intelligence, mental health screening, Explainable Artificial, Artificial Intelligence, health screening
备注:
点击查看摘要
Abstract:Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.
155. 【2510.13827】Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL
链接:https://arxiv.org/abs/2510.13827
作者:Ashish Kattamuri,Ishita Prasad,Meetu Malhotra,Arpita Vats,Rahul Raja,Albert Lie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Relative Policy Optimization, Group Relative Policy, contrastive reward signal, semantic accuracy, reward signal
备注: 20th International Workshop on Semantic and Social Media Adaptation Personalization
点击查看摘要
Abstract:Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.
156. 【2510.13811】Generative AI in Heritage Practice: Improving the Accessibility of Heritage Guidance
链接:https://arxiv.org/abs/2510.13811
作者:Jessica Witte,Edmund Lee,Lisa Brausem,Verity Shillabeer,Chiara Bonacchi
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Generative Artificial Intelligence, integrating Generative Artificial, Artificial Intelligence, Generative Artificial, public-facing guidance documents
备注: 21 pages
点击查看摘要
Abstract:This paper discusses the potential for integrating Generative Artificial Intelligence (GenAI) into professional heritage practice with the aim of enhancing the accessibility of public-facing guidance documents. We developed HAZEL, a GenAI chatbot fine-tuned to assist with revising written guidance relating to heritage conservation and interpretation. Using quantitative assessments, we compare HAZEL's performance to that of ChatGPT (GPT-4) in a series of tasks related to the guidance writing process. The results of this comparison indicate a slightly better performance of HAZEL over ChatGPT, suggesting that the GenAI chatbot is more effective once the underlying large language model (LLM) has been fine-tuned. However, we also note significant limitations, particularly in areas requiring cultural sensitivity and more advanced technical expertise. These findings suggest that, while GenAI cannot replace human heritage professionals in technical authoring tasks, its potential to automate and expedite certain aspects of guidance writing could offer valuable benefits to heritage organisations, especially in resource-constrained contexts.
信息检索
1. 【2510.14880】Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report
链接:https://arxiv.org/abs/2510.14880
作者:Rikiya Takehi,Benjamin Clavié,Sean Lee,Aamir Shakir
类目:Information Retrieval (cs.IR)
关键词:parameter counts, models, Abstract, retrieval, work
备注:
点击查看摘要
Abstract:In this work, we introduce mxbai-edge-colbert-v0 models, at two different parameter counts: 17M and 32M. As part of our research, we conduct numerous experiments to improve retrieval and late-interaction models, which we intend to distill into smaller models as proof-of-concepts. Our ultimate aim is to support retrieval at all scales, from large-scale retrieval which lives in the cloud to models that can run locally, on any device. mxbai-edge-colbert-v0 is a model that we hope will serve as a solid foundation backbone for all future experiments, representing the first version of a long series of small proof-of-concepts. As part of the development of mxbai-edge-colbert-v0, we conducted multiple ablation studies, of which we report the results. In terms of downstream performance, mxbai-edge-colbert-v0 is a particularly capable small model, outperforming ColBERTv2 on common short-text benchmarks (BEIR) and representing a large step forward in long-context tasks, with unprecedented efficiency.
2. 【2510.14857】A Simulation Framework for Studying Systemic Effects of Feedback Loops in Recommender Systems
链接:https://arxiv.org/abs/2510.14857
作者:Gabriele Barlacchi,Margherita Lalli,Emanuele Ferragina,Fosca Giannotti,Luca Pappalardo
类目:Information Retrieval (cs.IR); Computers and Society (cs.CY)
关键词:collective market dynamics, systems continuously interact, market dynamics, creating feedback loops, continuously interact
备注: 12 pages, 4 figures
点击查看摘要
Abstract:Recommender systems continuously interact with users, creating feedback loops that shape both individual behavior and collective market dynamics. This paper introduces a simulation framework to model these loops in online retail environments, where recommenders are periodically retrained on evolving user-item interactions. Using the Amazon e-Commerce dataset, we analyze how different recommendation algorithms influence diversity, purchase concentration, and user homogenization over time. Results reveal a systematic trade-off: while the feedback loop increases individual diversity, it simultaneously reduces collective diversity and concentrates demand on a few popular items. Moreover, for some recommender systems, the feedback loop increases user homogenization over time, making user purchase profiles increasingly similar. These findings underscore the need for recommender designs that balance personalization with long-term diversity.
3. 【2510.14824】Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
链接:https://arxiv.org/abs/2510.14824
作者:Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:binary label prediction, relevant query-document pairs, metric learning, binary label, relevance vs. irrelevance
备注:
点击查看摘要
Abstract:In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
4. 【2510.14788】Cross-Scenario Unified Modeling of User Interests at Billion Scale
链接:https://arxiv.org/abs/2510.14788
作者:Manjie Xu,Cheng Chen,Xin Jia,Jingyi Zhou,Yongji Wu,Zejian Wang,Chi Zhang,Kai Zuo,Yibo Chen,Xu Tang,Yao Hu,Yixin Zhu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:feed browsing, manifesting through complex, complex behavioral patterns, User, User interests
备注: The dataset, code, and models will be released soon
点击查看摘要
Abstract:User interests on content platforms are inherently diverse, manifesting through complex behavioral patterns across heterogeneous scenarios such as search, feed browsing, and content discovery. Traditional recommendation systems typically prioritize business metric optimization within isolated specific scenarios, neglecting cross-scenario behavioral signals and struggling to integrate advanced techniques like LLMs at billion-scale deployments, which finally limits their ability to capture holistic user interests across platform touchpoints. We propose RED-Rec, an LLM-enhanced hierarchical Recommender Engine for Diversified scenarios, tailored for industry-level content recommendation systems. RED-Rec unifies user interest representations across multiple behavioral contexts by aggregating and synthesizing actions from varied scenarios, resulting in comprehensive item and user modeling. At its core, a two-tower LLM-powered framework enables nuanced, multifaceted representations with deployment efficiency, and a scenario-aware dense mixing and querying policy effectively fuses diverse behavioral signals to capture cross-scenario user intent patterns and express fine-grained, context-specific intents during serving. We validate RED-Rec through online A/B testing on hundreds of millions of users in RedNote through online A/B testing, showing substantial performance gains in both content recommendation and advertisement targeting tasks. We further introduce a million-scale sequential recommendation dataset, RED-MMU, for comprehensive offline training and evaluation. Our work advances unified user modeling, unlocking deeper personalization and fostering more meaningful user engagement in large-scale UGC platforms.
5. 【2510.14704】Dataset Pruning in RecSys and ML: Best Practice or Mal-Practice?
链接:https://arxiv.org/abs/2510.14704
作者:Leonie Winter
类目:Information Retrieval (cs.IR)
关键词:system research depend, research depend heavily, Offline evaluations, recommender system research, MovieLens collections
备注: 69 pages, 14 figures
点击查看摘要
Abstract:Offline evaluations in recommender system research depend heavily on datasets, many of which are pruned, such as the widely used MovieLens collections. This thesis examines the impact of data pruning - specifically, removing users with fewer than a specified number of interactions - on both dataset characteristics and algorithm performance. Five benchmark datasets were analysed in both their unpruned form and at five successive pruning levels (5, 10, 20, 50, 100). For each coreset, we examined structural and distributional characteristics and trained and tested eleven representative algorithms. To further assess if pruned datasets lead to artificially inflated performance results, we also evaluated models trained on the pruned train sets but tested on unpruned data. Results show that commonly applied core pruning can be highly selective, leaving as little as 2% of the original users in some datasets. Traditional algorithms achieved higher nDCG@10 scores when both training and testing on pruned data; however, this advantage largely disappeared when evaluated on unpruned test sets. Across all algorithms, performance declined with increasing pruning levels when tested on unpruned data, highlighting the impact of dataset reduction on the performance of recommender algorithms.
6. 【2510.14670】ITAN: Graph-Executable Reasoning for Cyber Threat Intelligence
链接:https://arxiv.org/abs/2510.14670
作者:Marco Simoni,Aleksandar Fontana,Andrea Saracino,Paolo Mori
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:Automated Navigation, Intelligence Through Automated, connects natural-language cyber, natural-language cyber threat, cyber threat queries
备注:
点击查看摘要
Abstract:TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.
7. 【2510.14660】An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs
链接:https://arxiv.org/abs/2510.14660
作者:Linyue Ma,Yilong Xu,Xiang Long,Zhi Zheng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, empowers Large Language, augmentation empowers Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.
8. 【2510.14641】Causality Enhancement for Cross-Domain Recommendation
链接:https://arxiv.org/abs/2510.14641
作者:Zhibo Wu,Yunfan Wu,Lin Jiang,Ping Yang,Yao Hu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:source domain tasks, forms a crucial, crucial component, source domain, Cross-domain recommendation forms
备注:
点击查看摘要
Abstract:Cross-domain recommendation forms a crucial component in recommendation systems. It leverages auxiliary information through source domain tasks or features to enhance target domain recommendations. However, incorporating inconsistent source domain tasks may result in insufficient cross-domain modeling or negative transfer. While incorporating source domain features without considering the underlying causal relationships may limit their contribution to final predictions. Thus, a natural idea is to directly train a cross-domain representation on a causality-labeled dataset from the source to target domain. Yet this direction has been rarely explored, as identifying unbiased real causal labels is highly challenging in real-world scenarios. In this work, we attempt to take a first step in this direction by proposing a causality-enhanced framework, named CE-CDR. Specifically, we first reformulate the cross-domain recommendation as a causal graph for principled guidance. We then construct a causality-aware dataset heuristically. Subsequently, we derive a theoretically unbiased Partial Label Causal Loss to generalize beyond the biased causality-aware dataset to unseen cross-domain patterns, yielding an enriched cross-domain representation, which is then fed into the target model to enhance target-domain recommendations. Theoretical and empirical analyses, as well as extensive experiments, demonstrate the rationality and effectiveness of CE-CDR and its general applicability as a model-agnostic plugin. Moreover, it has been deployed in production since April 2025, showing its practical value in real-world applications.
9. 【2510.14640】Intent Clustering with Shared Pseudo-Labels
链接:https://arxiv.org/abs/2510.14640
作者:I-Fan Lin,Faegheh Hasibi,Suzan Verberne
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:makes minimal assumptions, propose an intuitive, training-free and label-free, intent clustering, clustering that makes
备注:
点击查看摘要
Abstract:In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.
10. 【2510.14629】MR.Rec: Synergizing Memory and Reasoning for Personalized Recommendation Assistant with LLMs
链接:https://arxiv.org/abs/2510.14629
作者:Jiani Huang,Xingchen Zou,Lianghao Xia,Qing Li
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, faces key challenges, application of Large
备注:
点击查看摘要
Abstract:The application of Large Language Models (LLMs) in recommender systems faces key challenges in delivering deep personalization and intelligent reasoning, especially for interactive scenarios. Current methods are often constrained by limited context windows and single-turn reasoning, hindering their ability to capture dynamic user preferences and proactively reason over recommendation contexts. To address these limitations, we propose this http URL, a novel framework that synergizes memory and reasoning for LLM-based recommendations. To achieve personalization, we develop a comprehensive Retrieval-Augmented Generation (RAG) system that efficiently indexes and retrieves relevant external memory to enhance LLM personalization capabilities. Furthermore, to enable the synergy between memory and reasoning, our RAG system goes beyond conventional query-based retrieval by integrating reasoning enhanced memory retrieval. Finally, we design a reinforcement learning framework that trains the LLM to autonomously learn effective strategies for both memory utilization and reasoning refinement. By combining dynamic memory retrieval with adaptive reasoning, this approach ensures more accurate, context-aware, and highly personalized recommendations. Extensive experiments demonstrate that this http URL significantly outperforms state-of-the-art baselines across multiple metrics, validating its efficacy in delivering intelligent and personalized recommendations. We will release code and data upon paper notification.
11. 【2510.14626】GemiRec: Interest Quantization and Generation for Multi-Interest Recommendation
链接:https://arxiv.org/abs/2510.14626
作者:Zhibo Wu,Yunfan Wu,Quan Liu,Lin Jiang,Ping Yang,Yao Hu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:gained attention, interest, industrial retrieval stage, retrieval stage, user
备注:
点击查看摘要
Abstract:Multi-interest recommendation has gained attention, especially in industrial retrieval stage. Unlike classical dual-tower methods, it generates multiple user representations instead of a single one to model comprehensive user interests. However, prior studies have identified two underlying limitations: the first is interest collapse, where multiple representations homogenize. The second is insufficient modeling of interest evolution, as they struggle to capture latent interests absent from a user's historical behavior. We begin with a thorough review of existing works in tackling these limitations. Then, we attempt to tackle these limitations from a new perspective. Specifically, we propose a framework-level refinement for multi-interest recommendation, named GemiRec. The proposed framework leverages interest quantization to enforce a structural interest separation and interest generation to learn the evolving dynamics of user interests explicitly. It comprises three modules: (a) Interest Dictionary Maintenance Module (IDMM) maintains a shared quantized interest dictionary. (b) Multi-Interest Posterior Distribution Module (MIPDM) employs a generative model to capture the distribution of user future interests. (c) Multi-Interest Retrieval Module (MIRM) retrieves items using multiple user-interest representations. Both theoretical and empirical analyses, as well as extensive experiments, demonstrate its advantages and effectiveness. Moreover, it has been deployed in production since March 2025, showing its practical value in industrial applications.
12. 【2510.14592】Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
链接:https://arxiv.org/abs/2510.14592
作者:Rashmi R,Vidyadhar Upadhya
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:Current Retrieval-Augmented Generation, Current Retrieval-Augmented, Retrieval-Augmented Generation, unimodal textual data, limiting their effectiveness
备注: 12 pages, 6 figures, submitted for review
点击查看摘要
Abstract:Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
13. 【2510.14545】Agentic Entropy-Balanced Policy Optimization
链接:https://arxiv.org/abs/2510.14545
作者:Guanting Dong,Licheng Bao,Zhongyuan Wang,Kangzhi Zhao,Xiaoxi Li,Jiajie Jin,Jinghan Yang,Hangyu Mao,Fuzheng Zhang,Kun Gai,Guorui Zhou,Yutao Zhu,Ji-Rong Wen,Zhicheng Dou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Agentic Reinforcement Learning, long-horizon tool-use capabilities, made significant progress, Agentic Reinforcement, Reinforcement Learning
备注: Working in progress
点击查看摘要
Abstract:Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
14. 【2510.14535】Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval
链接:https://arxiv.org/abs/2510.14535
作者:Keima Abe,Hayato Muraki,Shuhei Tomoshige,Kenichi Oishi,Hitoshi Iyatomi
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:imaging sites due, boldsymbol, show domain shifts, degrade machine learning, protocol differences
备注: 6 pages,3 figures, 3 tables. Accepted at 2025 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2025)
点击查看摘要
Abstract:Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.
15. 【2510.14400】MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering
链接:https://arxiv.org/abs/2510.14400
作者:Yingpeng Ning,Yuanyuan Sun,Ling Luo,Yanhua Wang,Yuchen Pan,Hongfei Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:requires accurate interpretation, Biomedical question answering, question answering, requires accurate, accurate interpretation
备注:
点击查看摘要
Abstract:Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.
16. 【2510.14377】PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora
链接:https://arxiv.org/abs/2510.14377
作者:Mykolas Sveistrys,Richard Kunert
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:large language models, Recent advances, language models, retrieval-augmented generation, advances in large
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
17. 【2510.14330】Ensembling Multiple Hallucination Detectors Trained on VLLM Internal Representations
链接:https://arxiv.org/abs/2510.14330
作者:Yuto Nakamizo,Ryuhei Miyazato,Hikaru Tanabe,Ryuta Yamakura,Kiori Hatanaka
类目:Information Retrieval (cs.IR)
关键词:Meta CRAG-MM Challenge, KDD Cup, Challenge at KDD, Meta CRAG-MM, CRAG-MM Challenge
备注: 5th place solution at Meta KDD Cup 2025
点击查看摘要
Abstract:This paper presents the 5th place solution by our team, y3h2, for the Meta CRAG-MM Challenge at KDD Cup 2025. The CRAG-MM benchmark is a visual question answering (VQA) dataset focused on factual questions about images, including egocentric images. The competition was contested based on VQA accuracy, as judged by an LLM-based automatic evaluator. Since incorrect answers result in negative scores, our strategy focused on reducing hallucinations from the internal representations of the VLM. Specifically, we trained logistic regression-based hallucination detection models using both the hidden_state and the outputs of specific attention heads. We then employed an ensemble of these models. As a result, while our method sacrificed some correct answers, it significantly reduced hallucinations and allowed us to place among the top entries on the final leaderboard. For implementation details and code, please refer to this https URL.
18. 【2510.14321】Large Reasoning Embedding Models: Towards Next-Generation Dense Retrieval Paradigm
链接:https://arxiv.org/abs/2510.14321
作者:Jianting Tang,Dongshuai Li,Tao Wen,Fuyu Lv,Dan Ou,Linli Xu
类目:Information Retrieval (cs.IR)
关键词:e-commerce search systems, modern e-commerce search, search systems, indispensable component, Reasoning
备注:
点击查看摘要
Abstract:In modern e-commerce search systems, dense retrieval has become an indispensable component. By computing similarities between query and item (product) embeddings, it efficiently selects candidate products from large-scale repositories. With the breakthroughs in large language models (LLMs), mainstream embedding models have gradually shifted from BERT to LLMs for more accurate text modeling. However, these models still adopt direct-embedding methods, and the semantic accuracy of embeddings remains inadequate. Therefore, contrastive learning is heavily employed to achieve tight semantic alignment between positive pairs. Consequently, such models tend to capture statistical co-occurrence patterns in the training data, biasing them toward shallow lexical and semantic matches. For difficult queries exhibiting notable lexical disparity from target items, the performance degrades significantly. In this work, we propose the Large Reasoning Embedding Model (LREM), which novelly integrates reasoning processes into representation learning. For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval. This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy. Specifically, we adopt a two-stage training process: the first stage optimizes the LLM on carefully curated Query-CoT-Item triplets with SFT and InfoNCE losses to establish preliminary reasoning and embedding capabilities, and the second stage further refines the reasoning trajectories via reinforcement learning (RL). Extensive offline and online experiments validate the effectiveness of LREM, leading to its deployment on China's largest e-commerce platform since August 2025.
19. 【2510.14296】Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL
链接:https://arxiv.org/abs/2510.14296
作者:Md Mahadi Hasan Nahid,Davood Rafiei,Weiwei Zhang,Yong Zhang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:aligning natural language, natural language questions, database schema elements, process of aligning, aligning natural
备注: 30 Pages
点击查看摘要
Abstract:Schema linking -- the process of aligning natural language questions with database schema elements -- is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50\%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.
20. 【2510.14278】PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering
链接:https://arxiv.org/abs/2510.14278
作者:Md Mahadi Hasan Nahid,Davood Rafiei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:requires gathering multiple, gathering multiple pieces, answering complex questions, complex questions requires, questions requires gathering
备注: 18 pages
点击查看摘要
Abstract:Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.
21. 【2510.14257】Synergistic Integration and Discrepancy Resolution of Contextualized Knowledge for Personalized Recommendation
链接:https://arxiv.org/abs/2510.14257
作者:Lingyu Mu,Hao Deng,Haibo Xing,Kaican Lin,Zhitong Zhu,Yu Zhang,Xiaoyi Zeng,Zhengxiao Liu,Zheng Lin,Jinxin Hu
类目:Information Retrieval (cs.IR)
关键词:large language models, revealed promising potential, enhanced reasoning capabilities, extract world knowledge, language models
备注:
点击查看摘要
Abstract:The integration of large language models (LLMs) into recommendation systems has revealed promising potential through their capacity to extract world knowledge for enhanced reasoning capabilities. However, current methodologies that adopt static schema-based prompting mechanisms encounter significant limitations: (1) they employ universal template structures that neglect the multi-faceted nature of user preference diversity; (2) they implement superficial alignment between semantic knowledge representations and behavioral feature spaces without achieving comprehensive latent space integration. To address these challenges, we introduce CoCo, an end-to-end framework that dynamically constructs user-specific contextual knowledge embeddings through a dual-mechanism approach. Our method realizes profound integration of semantic and behavioral latent dimensions via adaptive knowledge fusion and contradiction resolution modules. Experimental evaluations across diverse benchmark datasets and an enterprise-level e-commerce platform demonstrate CoCo's superiority, achieving a maximum 8.58% improvement over seven cutting-edge methods in recommendation accuracy. The framework's deployment on a production advertising system resulted in a 1.91% sales growth, validating its practical effectiveness. With its modular design and model-agnostic architecture, CoCo provides a versatile solution for next-generation recommendation systems requiring both knowledge-enhanced reasoning and personalized adaptation.
22. 【2510.14223】Large Scale Retrieval for the LinkedIn Feed using Causal Language Models
链接:https://arxiv.org/abs/2510.14223
作者:Sudarshan Srinivasa Ramanujam,Antonio Alonso,Saurabh Kataria,Siddharth Dangi,Akhilesh Gupta,Birjodh Singh Tiwana,Manas Somaiya,Luke Simon,David Byrne,Sojeong Ha,Sen Zhou,Andrei Akterskii,Zhanglong Liu,Samira Sriram,Crescent Xiong,Zhoutao Pei,Angela Shao,Alex Li,Annie Xiao,Caitlin Kolb,Thomas Kistler,Zach Moore,Hamed Firooz
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:LinkedIn Feed serves, LinkedIn Feed, hundreds of millions, large scale recommendation, Feed serves suggested
备注: 9 pages, 4 figures
点击查看摘要
Abstract:In large scale recommendation systems like the LinkedIn Feed, the retrieval stage is critical for narrowing hundreds of millions of potential candidates to a manageable subset for ranking. LinkedIn's Feed serves suggested content from outside of the member's network (based on the member's topical interests), where 2000 candidates are retrieved from a pool of hundreds of millions candidate with a latency budget of a few milliseconds and inbound QPS of several thousand per second. This paper presents a novel retrieval approach that fine-tunes a large causal language model (Meta's LLaMA 3) as a dual encoder to generate high quality embeddings for both users (members) and content (items), using only textual input. We describe the end to end pipeline, including prompt design for embedding generation, techniques for fine-tuning at LinkedIn's scale, and infrastructure for low latency, cost effective online serving. We share our findings on how quantizing numerical features in the prompt enables the information to get properly encoded in the embedding, facilitating greater alignment between the retrieval and ranking layer. The system was evaluated using offline metrics and an online A/B test, which showed substantial improvements in member engagement. We observed significant gains among newer members, who often lack strong network connections, indicating that high-quality suggested content aids retention. This work demonstrates how generative language models can be effectively adapted for real time, high throughput retrieval in industrial applications.
23. 【2510.14162】FinAI Data Assistant: LLM-based Financial Database Query Processing with the OpenAI Function Calling API
链接:https://arxiv.org/abs/2510.14162
作者:Juhyeong Kim,Yejin Kim,Youngbin Lee,Hyunwoo Byun
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Function Calling API, combines large language, OpenAI Function Calling, large language models, Calling API
备注: 4 pages, 2 figures, accepted at CIKM 2025 FinAI Workshop
点击查看摘要
Abstract:We present FinAI Data Assistant, a practical approach for natural-language querying over financial databases that combines large language models (LLMs) with the OpenAI Function Calling API. Rather than synthesizing complete SQL via text-to-SQL, our system routes user requests to a small library of vetted, parameterized queries, trading generative flexibility for reliability, low latency, and cost efficiency. We empirically study three questions: (RQ1) whether LLMs alone can reliably recall or extrapolate time-dependent financial data without external retrieval; (RQ2) how well LLMs map company names to stock ticker symbols; and (RQ3) whether function calling outperforms text-to-SQL for end-to-end database query processing. Across controlled experiments on prices and fundamentals, LLM-only predictions exhibit non-negligible error and show look-ahead bias primarily for stock prices relative to model knowledge cutoffs. Ticker-mapping accuracy is near-perfect for NASDAQ-100 constituents and high for S\P~500 firms. Finally, FinAI Data Assistant achieves lower latency and cost and higher reliability than a text-to-SQL baseline on our task suite. We discuss design trade-offs, limitations, and avenues for deployment.
24. 【2510.13922】LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding
链接:https://arxiv.org/abs/2510.13922
作者:Mohammad Mansoori,Amira Soliman,Farzaneh Etminani
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:unstructured text provided, Clinical notes, patient encounters, unstructured text, text provided
备注:
点击查看摘要
Abstract:Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.
计算机视觉
1. 【2510.14981】Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
链接:https://arxiv.org/abs/2510.14981
作者:Hadi Alzayer,Yunzhi Zhang,Chen Geng,Jia-Bin Huang,Jiajun Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:present an inference-time, method to perform, inference-time diffusion sampling, diffusion sampling method, image
备注: Project page: [this https URL](https://coupled-diffusion.github.io)
点击查看摘要
Abstract:We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
2. 【2510.14980】Agentic Design of Compositional Machines
链接:https://arxiv.org/abs/2510.14980
作者:Wenqian Zhang,Weiyang Liu,Zhen Liu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:complex machines stands, engineering practice, marker of human, human intelligence, foundation of engineering
备注: 75 pages, 31 figures, Project Page: [this https URL](https://besiegefield.github.io)
点击查看摘要
Abstract:The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
3. 【2510.14979】From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
链接:https://arxiv.org/abs/2510.14979
作者:Haiwen Diao,Mingxuan Li,Silei Wu,Linjun Dai,Xiaohua Wang,Hanming Deng,Lewei Lu,Dahua Lin,Ziwei Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:evolving model architectures, native VLMs, typical modular VLMs, shaped by evolving, training paradigms
备注: 21 pages, 7 figures
点击查看摘要
Abstract:The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: this https URL.
4. 【2510.14978】Learning an Image Editing Model without Image Editing Pairs
链接:https://arxiv.org/abs/2510.14978
作者:Nupur Kumari,Sheng-Yu Wang,Nanxuan Zhao,Yotam Nitzan,Yuheng Li,Krishna Kumar Singh,Richard Zhang,Eli Shechtman,Jun-Yan Zhu,Xun Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:achieved impressive results, Recent image editing, natural language editing, Recent image, achieved impressive
备注: project page: [this https URL](https://nupurkmr9.github.io/npedit/)
点击查看摘要
Abstract:Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
5. 【2510.14977】rra: Explorable Native 3D World Model with Point Latents
链接:https://arxiv.org/abs/2510.14977
作者:Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Xin Tao,Pengfei Wan,Jie Zhou,Jiwen Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:garnered increasing attention, garnered increasing, increasing attention, attention for comprehensive, World
备注: Project Page: [this https URL](https://huang-yh.github.io/terra/)
点击查看摘要
Abstract:World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
6. 【2510.14976】Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation
链接:https://arxiv.org/abs/2510.14976
作者:Shaowei Liu,Chuan Guo,Bing Zhou,Jian Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:Close-proximity human-human interactive, convey rich contextual, rich contextual information, Close-proximity human-human, poses convey rich
备注: Accepted to ICCV 2025. Project page: [this https URL](https://stevenlsw.github.io/ponimator/)
点击查看摘要
Abstract:Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.
7. 【2510.14975】WithAnyone: Towards Controllable and ID Consistent Image Generation
链接:https://arxiv.org/abs/2510.14975
作者:Hengyuan Xu,Wei Cheng,Peng Xing,Yixiao Fang,Shuhan Wu,Rui Wang,Xianfang Zeng,Daxin Jiang,Gang Yu,Xingjun Ma,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieving notable success, producing images aligned, recent models achieving, models achieving notable, Identity-consistent generation
备注: 23 Pages; Project Page: [this https URL](https://doby-xu.github.io/WithAnyone/;) Code: [this https URL](https://github.com/Doby-Xu/WithAnyone)
点击查看摘要
Abstract:Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
8. 【2510.14974】pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
链接:https://arxiv.org/abs/2510.14974
作者:Hansheng Chen,Kai Zhang,Hao Tan,Leonidas Guibas,Gordon Wetzstein,Sai Bi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:flow-based generative models, generative models typically, models typically distill, denoised data, diffusion or flow-based
备注: Code: [this https URL](https://github.com/Lakonik/piFlow) Demos: [this https URL](https://huggingface.co/spaces/Lakonik/pi-Qwen) and [this https URL](https://huggingface.co/spaces/Lakonik/pi-FLUX.1)
点击查看摘要
Abstract:Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.
9. 【2510.14968】RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks
链接:https://arxiv.org/abs/2510.14968
作者:Mingxuan Yan,Yuping Wang,Zechun Liu,Jiachen Li
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
关键词:employ vision-language model, frameworks employ vision-language, decompose complex manipulation, tackle long-horizon tasks, complex manipulation tasks
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025); Project Website: [this http URL](http://rdd-neurips.github.io)
点击查看摘要
Abstract:To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at this http URL.
10. 【2510.14965】ChangingGrounding: 3D Visual Grounding in Changing Scenes
链接:https://arxiv.org/abs/2510.14965
作者:Miao Hu,Zhiwei Huang,Tai Wang,Jiangmiao Pang,Dahua Lin,Nanning Zheng,Runsen Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world robots localize, robots localize objects, robots localize, natural-language instructions, Real-world robots
备注: 30 pages
点击查看摘要
Abstract:Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: this https URL .
11. 【2510.14962】RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion
链接:https://arxiv.org/abs/2510.14962
作者:Thao Nguyen,Jiaqi Ma,Fahad Shahbaz Khan,Souhaib Ben Taieb,Salman Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:predicting future radar, future radar echo, radar echo sequences, challenging task due, tightly coupled spatio-temporal
备注:
点击查看摘要
Abstract:Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.
12. 【2510.14960】C4D: 4D Made from 3D through Dual Correspondences
链接:https://arxiv.org/abs/2510.14960
作者:Shizun Wang,Zhenxiang Jiang,Xingyi Yang,Xinchao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:inevitably challenging problem, jointly estimates dynamic, monocular video, challenging problem, jointly estimates
备注: Accepted to ICCV 2025
点击查看摘要
Abstract:Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: this https URL
13. 【2510.14958】MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
链接:https://arxiv.org/abs/2510.14958
作者:Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Large Multimodal Models, Language Models, excelled in textual
备注: Project Page: [this https URL](https://mathcanvas.github.io/)
点击查看摘要
Abstract:While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: this https URL
14. 【2510.14955】RealDPO: Real or Not Real, that is the Preference
链接:https://arxiv.org/abs/2510.14955
作者:Guo Cheng,Danni Yang,Ziqi Huang,Jianlou Si,Chenyang Si,Ziwei Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:recently achieved notable, achieved notable advancements, recently achieved, achieved notable, notable advancements
备注: Code: [this https URL](https://github.com/Vchitect/RealDPO) Project Page: [this https URL](https://vchitect.github.io/RealDPO-Project/)
点击查看摘要
Abstract:Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.
15. 【2510.14954】OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression
链接:https://arxiv.org/abs/2510.14954
作者:Zhe Li,Weihao Yuan,Weichao Shen,Siyu Zhu,Zilong Dong,Chang Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Whole-body multi-modal human, multi-modal human motion, effective motion generation, motion generation poses, motion generation mechanism
备注:
点击查看摘要
Abstract:Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.
16. 【2510.14952】From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
链接:https://arxiv.org/abs/2510.14952
作者:Zhe Li,Cheng Chi,Yangyang Wei,Boan Zhu,Yibo Peng,Tao Huang,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Chang Xu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:pipelines remain cumbersome, Natural language offers, natural interface, locomotion pipelines remain, existing language-guided humanoid
备注:
点击查看摘要
Abstract:Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and untrustworthy. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking precision, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a universal foundation for vision-language-action humanoid systems.
17. 【2510.14949】DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
链接:https://arxiv.org/abs/2510.14949
作者:Yu Zhou,Sohyun An,Haikang Deng,Da Yin,Clark Peng,Cho-Jui Hsieh,Kai-Wei Chang,Nanyun Peng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:rich regional variations, Contact languages, exhibit rich regional, generative models, English exhibit rich
备注:
点击查看摘要
Abstract:Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins ( 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
18. 【2510.14945】3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
链接:https://arxiv.org/abs/2510.14945
作者:JoungBin Lee,Jaewoo Jung,Jisang Han,Takuya Narihira,Kazumi Fukuda,Junyoung Seo,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving scene consistency, chunk from arbitrary-length, scene consistency, input video, consistency
备注: Project page : [this https URL](https://cvlab-kaist.github.io/3DScenePrompt/)
点击查看摘要
Abstract:We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : this https URL
19. 【2510.14904】MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos
链接:https://arxiv.org/abs/2510.14904
作者:Gabriel Fiastre,Antoine Yang,Cordelia Schmid
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Dense Video Object, understand spatio-temporal details, Video Object Captioning, Dense Video, captioning object trajectories
备注: 20 pages, 8 figures
点击查看摘要
Abstract:Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at this https URL.
20. 【2510.14896】Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection
链接:https://arxiv.org/abs/2510.14896
作者:Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing semi-supervised video, Large Language Models, Multimodal Large Language, Existing semi-supervised, leveraging Multimodal Large
备注:
点击查看摘要
Abstract:Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.
21. 【2510.14885】You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
链接:https://arxiv.org/abs/2510.14885
作者:Logan Lawrence,Oindrila Saha,Megan Wei,Chen Sun,Subhransu Maji,Grant Van Horn
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, auto-regressive models remains, Multimodal Large, zero-shot visual classification
备注: Accepted to WACV26. 12 pages, 8 tables, 5 figures
点击查看摘要
Abstract:Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
22. 【2510.14882】ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention
链接:https://arxiv.org/abs/2510.14882
作者:Keli Liu,Zhendong Wang,Wengang Zhou,Shaodong Xu,Ruixiao Dong,Houqiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently achieved impressive, achieved impressive advances, proposed Reference Attention, Reference Attention module, Reference Attention
备注:
点击查看摘要
Abstract:Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.
23. 【2510.14876】BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data
链接:https://arxiv.org/abs/2510.14876
作者:Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Shizhan Zhu,Daniel Moura,Orly Zvitia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing collision prediction, excessive false alerts, collision prediction methods, Existing collision, collision prediction
备注:
点击查看摘要
Abstract:Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar's real-world dashcam collision dataset -- the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.
24. 【2510.14874】OUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
链接:https://arxiv.org/abs/2510.14874
作者:Guangyi Han,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:HOI, Hand-object interaction, Existing HOI generation, HOI generation, HOI generation research
备注:
点击查看摘要
Abstract:Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{this https URL}{here}$.
25. 【2510.14866】Benchmarking Multimodal Large Language Models for Face Recognition
链接:https://arxiv.org/abs/2510.14866
作者:Hatef Otroshi Shahreza,Sébastien Marcel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, achieved remarkable performance, large language models, large language
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.
26. 【2510.14862】Multi-modal video data-pipelines for machine learning with minimal human supervision
链接:https://arxiv.org/abs/2510.14862
作者:Mihai-Cristian Pîrvu,Marius Leordeanu
类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:real-world is inherently, Machine Learning models, inherently multi-modal, Machine Learning, Abstract
备注:
点击查看摘要
Abstract:The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb - semantic or text - sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.
27. 【2510.14855】A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation
链接:https://arxiv.org/abs/2510.14855
作者:Harsha Kotla,Arun Kumar Rajasekaran,Hannah Rana
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:improves survival rates, significantly improves survival, Early detection, skin lesions, survival rates
备注:
点击查看摘要
Abstract:Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.
28. 【2510.14847】ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
链接:https://arxiv.org/abs/2510.14847
作者:Meiqi Wu,Jiashu Zhu,Xiaokun Feng,Chubin Chen,Chen Zhu,Bingze Song,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:performance degrades notably, achieved remarkable progress, models have achieved, achieved remarkable, excelling in realistic
备注:
点击查看摘要
Abstract:Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
29. 【2510.14845】Backdoor Unlearning by Linear Task Decomposition
链接:https://arxiv.org/abs/2510.14845
作者:Amel Abdelraheem,Alessandro Favero,Gerome Bovet,Pascal Frossard
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized computer vision, enabling broad generalization, Foundation models, revolutionized computer, computer vision
备注:
点击查看摘要
Abstract:Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
30. 【2510.14836】QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models
链接:https://arxiv.org/abs/2510.14836
作者:Yixuan Li,Yuhui Chen,Mingcai Zhou,Haoran Li
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:accomplish fine-grained manipulation, accomplish fine-grained, fine-grained manipulation tasks, Spatial perception, augments VLA models
备注:
点击查看摘要
Abstract:Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
31. 【2510.14831】Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data
链接:https://arxiv.org/abs/2510.14831
作者:Qi Chen,Xinze Zhou,Chen Liu,Hao Chen,Wenxuan Li,Zekun Jiang,Ziyan Huang,Yuxuan Zhao,Dexin Yu,Junjun He,Yefeng Zheng,Ling Shao,Alan Yuille,Zongwei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:require medical experts, lack of large, segmentation is limited, hard to create, create and require
备注:
点击查看摘要
Abstract:AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
32. 【2510.14824】Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
链接:https://arxiv.org/abs/2510.14824
作者:Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:binary label prediction, relevant query-document pairs, metric learning, binary label, relevance vs. irrelevance
备注:
点击查看摘要
Abstract:In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
33. 【2510.14823】FraQAT: Quantization Aware Training with Fractional bits
链接:https://arxiv.org/abs/2510.14823
作者:Luca Morreale,Alberto Gil C. P. Ramos,Malcolm Chadwick,Mehid Noroozi,Ruchika Chavhan,Abhinav Mehrotra,Sourav Bhattacharya
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated impressive capabilities, large capacity model, demonstrated impressive, impressive capabilities, capabilities in image
备注:
点击查看摘要
Abstract:State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7\%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).
34. 【2510.14819】Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning
链接:https://arxiv.org/abs/2510.14819
作者:Ji Cao,Yu Wang,Tongya Zheng,Zujie Ren,Canghong Jin,Gang Chen,Mingli Song
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:including travel time, travel time estimation, encode raw trajectories, trajectory similarity analysis, location prediction
备注:
点击查看摘要
Abstract:Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: this https URL.
35. 【2510.14803】Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks
链接:https://arxiv.org/abs/2510.14803
作者:Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Szymon Płotka,Jieneng Chen,Qi Chen,Zheren Zhu,Jakub Prządo,Ibrahim E. Hamacı,Sezgin Er,Yuhan Wang,Ashwin Kumar,Bjoern Menze,Jarosław B. Ćwikła,Yuyin Zhou,Akshay S. Chaudhari,Curtis P. Langlotz,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:detection save lives, save lives, tumor, masks, Artificial intelligence
备注:
点击查看摘要
Abstract:Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks--detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor's size, number, appearance, and sometimes, pathology results--information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.14803 [cs.CV]
(or
arXiv:2510.14803v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.14803
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2510.14800】Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from HE Whole Slide Images
链接:https://arxiv.org/abs/2510.14800
作者:Usama Sajjad,Abdul Rehman Akbar,Ziyu Su,Deborah Knight,Wendy L. Frankel,Metin N. Gurcan,Wei Chen,Muhammad Khalid Khan Niazi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:projected deaths anticipated, prevalent malignancy globally, Colorectal cancer, malignancy globally, projected deaths
备注:
点击查看摘要
Abstract:Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.
37. 【2510.14792】CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection
链接:https://arxiv.org/abs/2510.14792
作者:Hojun Choi,Youngsun Lim,Jaeyo Shin,Hyunjung Shim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:localize object categories, seeks to recognize, recognize and localize, OVD, Open-vocabulary object detection
备注: 28 pages, 13 Figures, 12 Tables
点击查看摘要
Abstract:Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.
38. 【2510.14770】MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks
链接:https://arxiv.org/abs/2510.14770
作者:Zhang Nengbo,Hann Woei Ho,Ye Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Micro Air Vehicle, Air Vehicle, Micro Air, conventional radio-based methods, radio-based methods suffer
备注:
点击查看摘要
Abstract:Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (``start'', ``end'', ``1'', ``0''). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework's effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.
39. 【2510.14765】Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality
链接:https://arxiv.org/abs/2510.14765
作者:Giuseppe Lorenzo Catalano,Agata Marta Soccini
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:Space exploration increasingly, multidisciplinary scientific analysis, Virtual Reality, exploration increasingly relies, Space exploration
备注: 21 pages, 9 figures
点击查看摘要
Abstract:Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth's comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.
40. 【2510.14753】LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement
链接:https://arxiv.org/abs/2510.14753
作者:Xu Wu,Zhihui Lai,Xianxu Hou,Jie Zhou,Ya-nan Zhang,Linlin Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving high-quality color, aims to improve, preserving high-quality, Light Quantization Module, Low-light image enhancement
备注:
点击查看摘要
Abstract:Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.
41. 【2510.14741】DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models
链接:https://arxiv.org/abs/2510.14741
作者:Simone Carnemolla,Matteo Pennisi,Sarinda Samarasinghe,Giovanni Bellitto,Simone Palazzo,Daniela Giordano,Mubarak Shah,Concetto Spampinato
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Understanding and explaining, machine learning models, trustworthy AI systems, explaining the behavior, behavior of machine
备注: Accepted to NeurIPS 2025 (spotlight)
点击查看摘要
Abstract:Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at this https URL.
42. 【2510.14737】Free-Grained Hierarchical Recognition
链接:https://arxiv.org/abs/2510.14737
作者:Seulki Park,Zilin Wang,Stella X. Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:typically assume complete, assumption rarely met, assume complete, met in practice, existing methods typically
备注: 26 pages
点击查看摘要
Abstract:Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.
43. 【2510.14726】Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection
链接:https://arxiv.org/abs/2510.14726
作者:Dingzhou Xie,Rushi Lan,Cheng Pang,Enhao Ning,Jiahao Zeng,Wei Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made remarkable progress, improve feature discriminability, Recent object detection, Recent object, methods have made
备注:
点击查看摘要
Abstract:Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.
44. 【2510.14713】Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models
链接:https://arxiv.org/abs/2510.14713
作者:Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:narrative information essential, Camera movement conveys, movement conveys spatial, understanding video content, Camera movement
备注: 5 pages, accepted at AIROV2025
点击查看摘要
Abstract:Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.
45. 【2510.14709】Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery
链接:https://arxiv.org/abs/2510.14709
作者:Caleb Robinson,Kimberly T. Goetz,Christin B. Khan,Meredith Sackett,Kathleen Leonard,Rahul Dodhia,Juan M. Lavista Ferres
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:traditional survey methods, critical for conservation, difficult to scale, Effective monitoring, populations is critical
备注:
点击查看摘要
Abstract:Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. "interesting points". We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% -- from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at this https URL.
46. 【2510.14705】Leveraging Learned Image Prior for 3D Gaussian Compression
链接:https://arxiv.org/abs/2510.14705
作者:Seungjoo Shin,Jaesik Park,Sunghyun Cho
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently achieved considerable, achieved considerable success, minimizing storage overhead, Gaussian Splatting, preserving high rendering
备注: Accepted to ICCV 2025 Workshop on ECLR
点击查看摘要
Abstract:Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.
47. 【2510.14672】VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
链接:https://arxiv.org/abs/2510.14672
作者:Jinglei Zhang,Yuanfan Guo,Rolandos Alexandros Potamias,Jiankang Deng,Hang Xu,Chao Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:garnered considerable attention, multimodal large language, large language models, recent years, considerable attention
备注: Accepted by ICCV 2025
点击查看摘要
Abstract:In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: this https URL
48. 【2510.14668】WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
链接:https://arxiv.org/abs/2510.14668
作者:Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Sami Azam,Asif Karim,Jemima Beissbarth,Amanda Leach
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:static teacher-student framework, well-trained teacher transfers, teacher-student framework, traditionally relied, static teacher-student
备注:
点击查看摘要
Abstract:Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
49. 【2510.14661】EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)
链接:https://arxiv.org/abs/2510.14661
作者:Weikang Yu,Vincent Nwazelibe,Xianping Ma,Xiaokang Zhang,Richard Gloaguen,Xiao Xiang Zhu,Pedram Ghamisi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:contributing to deforestation, soil erosion, economic development, water contamination, activities are essential
备注:
点击查看摘要
Abstract:Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at this https URL.
50. 【2510.14657】Decorrelation Speeds Up Vision Transformers
链接:https://arxiv.org/abs/2510.14657
作者:Kieran Carrigg,Rob van Gastel,Melda Yeghaian,Sander Dalm,Faysal Boughorbel,Marcel van Gerven
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Masked Autoencoder, substantial computational costs, resource-constrained industrial settings, yields strong performance, integrating Decorrelated Backpropagation
备注: 15 pages, 12 figures, submitted to ICLR 2026
点击查看摘要
Abstract:Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
51. 【2510.14648】In-Context Learning with Unpaired Clips for Instruction-based Video Editing
链接:https://arxiv.org/abs/2510.14648
作者:Xinyao Liao,Xianfang Zeng,Ziye Song,Zhoujie Fu,Gang Yu,Guosheng Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:video remains underexplored, constructing large-scale paired, instruction-based video editing, video editing datasets, instruction-based image editing
备注:
点击查看摘要
Abstract:Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.
52. 【2510.14634】SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation
链接:https://arxiv.org/abs/2510.14634
作者:Jihyun Yu,Yoojin Oh,Wonho Bae,Mingyu Kim,Junhyug Noh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:correct performance degradation, unlabeled test data, Test-time adaptation, aims to correct, correct performance
备注:
点击查看摘要
Abstract:Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.
53. 【2510.14630】Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
链接:https://arxiv.org/abs/2510.14630
作者:Ming Gui,Johannes Schusterbauer,Timy Phan,Felix Krause,Josh Susskind,Miguel Angel Bautista,Björn Ommer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:introduce Representation Tokenizer, self-supervised vision transformers, single continuous latent, Representation Tokenizer, vision transformers
备注: Code: [this https URL](https://github.com/CompVis/RepTok)
点击查看摘要
Abstract:We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
54. 【2510.14627】GOPLA: Generalizable Object Placement Learning via Synthetic Augmentation of Human Arrangement
链接:https://arxiv.org/abs/2510.14627
作者:Yao Zhong,Hanzhi Chen,Simon Schaefer,Anran Zhang,Stefan Leutenegger
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:everyday household organization, Robots are expected, intelligent assistants, household organization, expected to serve
备注:
点击查看摘要
Abstract:Robots are expected to serve as intelligent assistants, helping humans with everyday household organization. A central challenge in this setting is the task of object placement, which requires reasoning about both semantic preferences (e.g., common-sense object relations) and geometric feasibility (e.g., collision avoidance). We present GOPLA, a hierarchical framework that learns generalizable object placement from augmented human demonstrations. A multi-modal large language model translates human instructions and visual inputs into structured plans that specify pairwise object relationships. These plans are then converted into 3D affordance maps with geometric common sense by a spatial mapper, while a diffusion-based planner generates placement poses guided by test-time costs, considering multi-plan distributions and collision avoidance. To overcome data scarcity, we introduce a scalable pipeline that expands human placement demonstrations into diverse synthetic training data. Extensive experiments show that our approach improves placement success rates by 30.04 percentage points over the runner-up, evaluated on positioning accuracy and physical plausibility, demonstrating strong generalization across a wide range of real-world robotic placement scenarios.
55. 【2510.14624】Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
链接:https://arxiv.org/abs/2510.14624
作者:Natan Bagrov,Eugene Khvedchenia,Borys Tymchenko,Shay Aharon,Lior Kadoch,Tomer Keren,Ofri Masad,Yonatan Geifman,Ran Zilberstein,Tuomas Rintamaki,Matthieu Le,Andrew Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:processing dense frame, Efficient Video Sampling, Vision-language models, recently expanded, scalability is fundamentally
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
56. 【2510.14617】Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding
链接:https://arxiv.org/abs/2510.14617
作者:Ning Ding,Keisuke Fujii,Toru Tamaki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:badminton involves interpreting, tactic, individual actions, involves interpreting, Tactical understanding
备注: 9 pages, 3 figures. Accepted to ACM MMSports 2025
点击查看摘要
Abstract:Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.
57. 【2510.14605】Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
链接:https://arxiv.org/abs/2510.14605
作者:Yuyang Hong,Jiaqi Gu,Qi Yang,Lubin Fan,Yue Wu,Ying Wang,Kun Ding,Shiming Xiang,Jieping Ye
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Knowledge-based visual question, visual question answering, Knowledge-based visual, question answering, understanding with external
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at this https URL
58. 【2510.14596】Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering
链接:https://arxiv.org/abs/2510.14596
作者:Hugo Markoff,Jevgenijs Galaktionovs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Camera traps generate, traps generate millions, camera trap analysis, Animal Detect platform, existing classifiers
备注: Extended abstract. Submitted to AICC: Workshop on AI for Climate and Conservation - EurIPS 2025 (non-archival)
点击查看摘要
Abstract:Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.
59. 【2510.14594】Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers
链接:https://arxiv.org/abs/2510.14594
作者:Hugo Markoff,Jevgenijs Galaktionovs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:conservative rollup strategies, high taxonomic levels, animal classification models, SpeciesNet provide predictions, rollup strategies
备注: Extended abstract. Submitted to AICC: Workshop on AI for Climate and Conservation - EurIPS 2025 (non-archival)
点击查看摘要
Abstract:State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from "blank" and "animal" labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent
60. 【2510.14588】STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding
链接:https://arxiv.org/abs/2510.14588
作者:Zhifei Chen,Tianshuo Xu,Leyi Wu,Luozhou Wang,Dongyu Yan,Zihan You,Wenting Luo,Guo Zhang,Yingcong Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:striking visual progress, interactions remains difficult, recently made striking, made striking visual, maintaining coherent object
备注: Code, model, and demos can be found at [this https URL](https://envision-research.github.io/STANCE/)
点击查看摘要
Abstract:Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB \(+\) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
61. 【2510.14583】alking Points: Describing and Localizing Pixels
链接:https://arxiv.org/abs/2510.14583
作者:Matan Rusanovsky,Shimon Malnick,Shai Avidan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:achieved remarkable success, achieved remarkable, remarkable success, success in cross-modal, Point
备注:
点击查看摘要
Abstract:Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on this http URL bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at this https URL.
62. 【2510.14576】CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification
链接:https://arxiv.org/abs/2510.14576
作者:Dongwook Lee,Sol Han,Jinwhan Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper presents CALM-Net, multi-branch neural network, paper presents, neural network, curvature-aware LiDAR point
备注: 10 pages, 7 figures
点击查看摘要
Abstract:This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97\% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.
63. 【2510.14564】BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU
链接:https://arxiv.org/abs/2510.14564
作者:Junyi Wu,Jiaming Xu,Jinhao Li,Yongkang Zhou,Jiayi Pan,Xingyang Li,Guohao Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian, Gaussian densification, color splatting, Gaussian Splatting, Gaussian projection
备注: Accepted by ASP-DAC 2026
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$\times$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.
Comments:
Accepted by ASP-DAC 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2510.14564 [cs.CV]
(or
arXiv:2510.14564v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.14564
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
64. 【2510.14560】Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video
链接:https://arxiv.org/abs/2510.14560
作者:Yulin Zhang,Cheng Shi,Yang Wang,Sibei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human-like settings, moving beyond mere, actively understand, unfolding events, capable of functioning
备注: Accepted at NeurIPS 2025 (preview; camera-ready in preparation)
点击查看摘要
Abstract:Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:this https URL
65. 【2510.14553】Consistent text-to-image generation via scene de-contextualization
链接:https://arxiv.org/abs/2510.14553
作者:Song Tang,Peihao Gong,Kunyu Li,Kai Guo,Boyu Wang,Mao Ye,Jianwei Zhang,Xiatian Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produce identity-preserving images, generation seeks, seeks to produce, produce identity-preserving, fails due
备注:
点击查看摘要
Abstract:Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
66. 【2510.14543】Exploring Cross-Modal Flows for Few-Shot Learning
链接:https://arxiv.org/abs/2510.14543
作者:Ziqi Jiang,Yanghao Wang,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Aligning features, PEFT methods, fundamental challenges, PEFT, Aligning
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
67. 【2510.14536】Exploring Image Representation with Decoupled Classical Visual Descriptors
链接:https://arxiv.org/abs/2510.14536
作者:Chenyuan Qu,Hao Chen,Jianbo Jiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, efficient image representations, long-standing challenge, challenge in computer, understanding efficient image
备注: Accepted by The 36th British Machine Vision Conference (BMVC 2025)
点击查看摘要
Abstract:Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: this https URL.
68. 【2510.14535】Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval
链接:https://arxiv.org/abs/2510.14535
作者:Keima Abe,Hayato Muraki,Shuhei Tomoshige,Kenichi Oishi,Hitoshi Iyatomi
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:imaging sites due, boldsymbol, show domain shifts, degrade machine learning, protocol differences
备注: 6 pages,3 figures, 3 tables. Accepted at 2025 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2025)
点击查看摘要
Abstract:Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.
69. 【2510.14532】owards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology
链接:https://arxiv.org/abs/2510.14532
作者:Xinrui Huang,Fan Xiao,Dongming He,Anqi Gao,Dandan Li,Xiaofan Zhang,Shaoting Zhang,Xudong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:maxillofacial radiology plays, trained professionals, radiographic image interpretation, maxillofacial radiology, radiology plays
备注:
点击查看摘要
Abstract:Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.
70. 【2510.14528】PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
链接:https://arxiv.org/abs/2510.14528
作者:Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Handong Zheng,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:resource-efficient model tailored, resource-efficient model, model tailored, document parsing, propose PaddleOCR-VL
备注:
点击查看摘要
Abstract:In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at this https URL .
71. 【2510.14526】Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models
链接:https://arxiv.org/abs/2510.14526
作者:Yunze Tong,Didi Zhu,Zijing Hu,Jinluan Yang,Ziyu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:pretrained Stable Diffusion, Stable Diffusion, pretrained Stable, induce distinct denoising, distinct denoising paths
备注: Appendix will be appended soon
点击查看摘要
Abstract:In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
72. 【2510.14525】Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing
链接:https://arxiv.org/abs/2510.14525
作者:Qurrat Ul Ain,Atif Aftab Ahmed Jilani,Zunaira Shafqat,Nigar Azhar Butt
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Defective surgical instruments, Defective surgical, mechanical integrity, risks to sterility, patient safety
备注:
点击查看摘要
Abstract:Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.
73. 【2510.14516】Vision Mamba for Permeability Prediction of Porous Media
链接:https://arxiv.org/abs/2510.14516
作者:Ali Kashefi,Tapan Mukerji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Mamba, recently received attention, Vision Transformers, Vision Mamba scales, Vision
备注:
点击查看摘要
Abstract:Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.
74. 【2510.14493】Grazing Detection using Deep Learning and Sentinel-2 Time Series Data
链接:https://arxiv.org/abs/2510.14493
作者:Aleksis Pirinen,Delia Fano Yela,Smita Chakraborty,Erik Källman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:occurs remains limited, grazing occurs remains, production and biodiversity, remains limited, shapes both agricultural
备注: Code and models: [this https URL](https://github.com/aleksispi/pib-ml-grazing)
点击查看摘要
Abstract:Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.
75. 【2510.14463】Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration
链接:https://arxiv.org/abs/2510.14463
作者:Thomas Katraouras,Dimitrios Rafailidis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:delivering visually appealing, visually appealing content, Image restoration, web platforms, Image
备注: Accepted at WI-IAT 2025
点击查看摘要
Abstract:Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at this https URL.
76. 【2510.14462】Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review
链接:https://arxiv.org/abs/2510.14462
作者:Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unsupervised deep generative, deep generative models, Unsupervised deep, promising alternative, methods for detecting
备注:
点击查看摘要
Abstract:Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.
77. 【2510.14460】Structured Universal Adversarial Attacks on Object Detection for Video Sequences
链接:https://arxiv.org/abs/2510.14460
作者:Sven Jacob,Weijia Shao,Gjergji Kasneci
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video-based object detection, Video-based object, safety-critical applications, object detection plays, plays a vital
备注: Accepted at GCPR 2025 (German Conference on Pattern Recognition). This is a different version as submitted to the conference, not the official conference proceedings
点击查看摘要
Abstract:Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at this https URL.
78. 【2510.14431】Real-Time Neural Video Compression with Unified Intra and Inter Coding
链接:https://arxiv.org/abs/2510.14431
作者:Hui Xiang,Yifan Bian,Li Li,Jingran Wu,Xianguo Zhang,Dong Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural video compression, offer superior compression, superior compression efficiency, Neural video, technologies have advanced
备注: 10 pages
点击查看摘要
Abstract:Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
79. 【2510.14427】Deep Compositional Phase Diffusion for Long Motion Sequence Generation
链接:https://arxiv.org/abs/2510.14427
作者:Ho Yin Au,Jie Chen,Junkun Jiang,Jingyu Xiang
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
关键词:Phase Diffusion Module, shown significant progress, Recent research, Semantic Phase Diffusion, Diffusion Module
备注: Accepted by NeurIPS 2025 (Oral)
点击查看摘要
Abstract:Recent research on motion generation has shown significant progress in generating semantically aligned motion with singular semantics. However, when employing these models to create composite sequences containing multiple semantically generated motion clips, they often struggle to preserve the continuity of motion dynamics at the transition boundaries between clips, resulting in awkward transitions and abrupt artifacts. To address these challenges, we present Compositional Phase Diffusion, which leverages the Semantic Phase Diffusion Module (SPDM) and Transitional Phase Diffusion Module (TPDM) to progressively incorporate semantic guidance and phase details from adjacent motion clips into the diffusion process. Specifically, SPDM and TPDM operate within the latent motion frequency domain established by the pre-trained Action-Centric Motion Phase Autoencoder (ACT-PAE). This allows them to learn semantically important and transition-aware phase information from variable-length motion clips during training. Experimental results demonstrate the competitive performance of our proposed framework in generating compositional motion sequences that align semantically with the input conditions, while preserving phase transitional continuity between preceding and succeeding motion clips. Additionally, motion inbetweening task is made possible by keeping the phase parameter of the input motion sequences fixed throughout the diffusion process, showcasing the potential for extending the proposed framework to accommodate various application scenarios. Codes are available at this https URL.
80. 【2510.14403】DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis
链接:https://arxiv.org/abs/2510.14403
作者:Chao Tu,Kun Huang,Jie Zhang,Qianjin Feng,Yu Zhang,Zhenyuan Ning
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pathology shows promise, computational pathology shows, objective prognostic modes, develop objective prognostic, quantify morphological heterogeneity
备注:
点击查看摘要
Abstract:The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at this https URL.
81. 【2510.14389】BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble
链接:https://arxiv.org/abs/2510.14389
作者:Brandon Hill,Kma Solaiman
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:high-volume electronics manufacturing, critical for ensuring, ensuring reliability, reliability in high-volume, high-volume electronics
备注: This paper has been submitted to IEEE/CVF WACV 2026 Applications track and is currently under review
点击查看摘要
Abstract:Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.
82. 【2510.14383】DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights
链接:https://arxiv.org/abs/2510.14383
作者:Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate brain tumor, Accurate brain, diagnosis and treatment, clinical diagnosis, State Space Models
备注:
点击查看摘要
Abstract:Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20\% test set used by recent methods, our model achieves Dice improvements of 0.10\% for whole tumor, 1.75\% for tumor core, and 0.93\% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86\% for tumor core and 1.45\% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.
83. 【2510.14376】DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
链接:https://arxiv.org/abs/2510.14376
作者:Dongnam Byun,Jungwon Park,Jumgmin Ko,Changin Choi,Wonjong Rhee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, Dissimilar Background Biases, generating high-quality images, high-quality images aligned, led to significant
备注:
点击查看摘要
Abstract:Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
84. 【2510.14374】Spatial Preference Rewarding for MLLMs Spatial Understanding
链接:https://arxiv.org/abs/2510.14374
作者:Han Qiu,Peng Gao,Lewei Lu,Xiaoqin Zhang,Ling Shao,Shijian Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, Multimodal large, demonstrated promising spatial, large language models, promising spatial understanding
备注: ICCV 2025
点击查看摘要
Abstract:Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at this https URL
85. 【2510.14359】AI for Service: Proactive Assistance with AI Glasses
链接:https://arxiv.org/abs/2510.14359
作者:Zichen Wen,Yiyu Wang,Chenfei Liao,Boxue Yang,Junxian Li,Weifeng Liu,Haocong He,Bolong Feng,Xuyang Liu,Yuanhuiyi Lyu,Xu Zheng,Xuming Hu,Linfeng Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:adaptive companion, daily life, active and adaptive, paradigm that enables, enables proactive
备注: 24 pages, 5 figures, work in progress
点击查看摘要
Abstract:In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.
86. 【2510.14354】Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration
链接:https://arxiv.org/abs/2510.14354
作者:Siddharth Tourani,Jayaram Reddy,Sarvesh Thakur,K Madhava Krishna,Muhammad Haris Khan,N Dinesh Reddy
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:consumer depth cameras, unlabeled RGB-D data, depth cameras, rise in consumer, consumer depth
备注: 8 pages, accepted at ICRA 2024 (International Conference on Robotics and Automation)
点击查看摘要
Abstract:With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.
87. 【2510.14349】Vision-Centric Activation and Coordination for Multimodal Large Language Models
链接:https://arxiv.org/abs/2510.14349
作者:Yunnan Wang,Fan Lu,Kecheng Zheng,Ziyuan Huang,Ziqiang Li,Wenjun Zeng,Xin Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, Multimodal large, large language models, demonstrating advanced comprehension, encoders with LLMs
备注: Under Review
点击查看摘要
Abstract:Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
88. 【2510.14314】A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection
链接:https://arxiv.org/abs/2510.14314
作者:Shivangi Yadav,Arun Ross
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:presentation attack detection, cosmetic contact lenses, lenses are presented, synthetic ocular images, printed eye images
备注:
点击查看摘要
Abstract:An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.
89. 【2510.14304】Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding
链接:https://arxiv.org/abs/2510.14304
作者:Kyungryul Back,Seongbeom Park,Milim Kim,Mincheol Kwon,SangHyeok Lee,Hyunyoung Lee,Junhee Cho,Seunghyun Park,Jinkyu Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, recently shown promising, shown promising results, Large Vision-Language, Vision-Language Models
备注: EMNLP 2025 Findings; Project: [this https URL](https://github.com/KR-0822/TCD)
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.
90. 【2510.14293】Learning Human-Humanoid Coordination for Collaborative Object Carrying
链接:https://arxiv.org/abs/2510.14293
作者:Yushi Du,Yixuan Li,Baoxiong Jia,Yutang Lin,Pei Zhou,Wei Liang,Yanchao Yang,Siyuan Huang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:shows significant promise, collaboration shows significant, Human-humanoid collaboration shows, domestic assistance, applications in healthcare
备注:
点击查看摘要
Abstract:Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
91. 【2510.14273】CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts
链接:https://arxiv.org/abs/2510.14273
作者:Kieu-Anh Truong Thi,Huy-Hieu Pham,Duc-Trong Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep learning models, data sources, poses a major, learning models, caused by differences
备注:
点击查看摘要
Abstract:Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.
92. 【2510.14270】GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
链接:https://arxiv.org/abs/2510.14270
作者:Alexander Valverde,Brian Xu,Yuyin Zhou,Meng Xu,Hongyun Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Neural Radiance Fields, Radiance Fields, Neural Radiance, achieving remarkable progress, Splatting achieving remarkable
备注:
点击查看摘要
Abstract:Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:
arXiv:2510.14270 [cs.CV]
(or
arXiv:2510.14270v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.14270
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
93. 【2510.14266】Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment
链接:https://arxiv.org/abs/2510.14266
作者:Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:event-based vision sensor, digital phase-locked loop, optical camera communication, camera communication systems, robust demodulation scheme
备注:
点击查看摘要
Abstract:We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.
94. 【2510.14260】MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching
链接:https://arxiv.org/abs/2510.14260
作者:Tingman Yan,Tao Liu,Xilian Yang,Qunfei Zhao,Zeyang Xia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:matching, Cross-view matching, relative positions, Cross-view, cross-attention mechanisms
备注:
点击查看摘要
Abstract:Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at this https URL.
95. 【2510.14256】Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
链接:https://arxiv.org/abs/2510.14256
作者:Xiangyu Meng,Zixian Zhang,Zhenghao Zhang,Junchao Liao,Long Qin,Weizhi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-human identity preservation, VACE and Phantom, diverse scenarios, dynamic interactions, characters are critical
备注:
点击查看摘要
Abstract:While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
96. 【2510.14255】Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
链接:https://arxiv.org/abs/2510.14255
作者:Liao Shen,Wentao Jiang,Yiran Zhu,Tiezheng Ge,Zhiguo Cao,Bo Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, Recent advances, temporally coherent videos, synthesizing high-quality, temporally coherent
备注:
点击查看摘要
Abstract:Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{this https URL}{this https URL}.
97. 【2510.14251】MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering
链接:https://arxiv.org/abs/2510.14251
作者:Mingkai Liu,Dikai Fan,Haohua Que,Haojia Gao,Xiao Liu,Shuxue Peng,Meixia Lin,Shengyu Gu,Ruicong Ye,Wanli Qiu,Handong Yao,Ruopeng Zhang,Xianliang Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scene Coordinate Regression, computational cost involved, Expert-based Accelerated Coordinate, Accelerated Coordinate Encoding, Mixed Expert-based Accelerated
备注: 8 pages
点击查看摘要
Abstract:Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.
98. 【2510.14245】Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication
链接:https://arxiv.org/abs/2510.14245
作者:Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Optical camera communication, promising visible light, event-based OCC systems, OCC systems, Optical camera
备注:
点击查看摘要
Abstract:Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.
99. 【2510.14241】PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis
链接:https://arxiv.org/abs/2510.14241
作者:Soumyya Kanti Datta,Tanvi Ranga,Chengzhe Sun,Siwei Lyu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:avatar-driven facial synthesis, insidious threat, lip-sync modifications, rise of manipulated, manipulated media
备注:
点击查看摘要
Abstract:The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at this https URL
100. 【2510.14230】LOTA: Bit-Planes Guided AI-Generated Image Detection
链接:https://arxiv.org/abs/2510.14230
作者:Hongsong Wang,Renxi Cheng,Yang Zhang,Chaolei Han,Jie Gui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion models makes, distinguish AI-generated images, rapid advancement, models makes, difficult to distinguish
备注: Published in the ICCV2025, COde is [this https URL](https://github.com/hongsong-wang/LOTA)
点击查看摘要
Abstract:The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2\% from GAN to Diffusion and over 99.2\% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at this https URL.
101. 【2510.14203】Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition
链接:https://arxiv.org/abs/2510.14203
作者:Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:recently attracted attention, joint modeling method, Big, long been studied, attention in psychology
备注: Accepted at APSIPA ASC 2025
点击查看摘要
Abstract:This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.
102. 【2510.14179】Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures
链接:https://arxiv.org/abs/2510.14179
作者:Yuancheng Xu,Wenqi Xian,Li Ma,Julien Philip,Ahmet Levent Taşel,Yiwei Zhao,Ryan Burgert,Mingming He,Oliver Hermann,Oliver Pilarski,Rahul Garg,Paul Debevec,Ning Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multi-view character consistency, Gaussian Splatting, character consistency, customization data pipeline, video diffusion models
备注: Accepted to SIGGRAPH Asia 2025
点击查看摘要
Abstract:We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: this https URL.
103. 【2510.14163】owards Reversible Model Merging For Low-rank Weights
链接:https://arxiv.org/abs/2510.14163
作者:Mohammadsajad Alipour,Mohammad Mohammadi Amiri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:combine multiple fine-tuned, multiple fine-tuned models, Model merging aims, Model, aims to combine
备注:
点击查看摘要
Abstract:Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks. While prior work has shown that merging can approximate the performance of individual fine-tuned models for each task, it largely overlooks scenarios where models are compressed into low-rank representations, either through low-rank adaptation (LoRA) or post-training singular value decomposition (SVD). We first demonstrate that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model. Motivated by this phenomenon, we propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis (e.g., an equivalent of holding two or more models) from which original task-specific models can be recovered via linear combination. This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. Crucially, this allows us to ``revert'' to each individual model when needed, recognizing that no merged model can consistently outperform one specialized for its task. Building on this insight, we introduce our method, Reversible Model Merging (RMM), an efficient, data-free, and flexible method that provides a closed-form solution for selecting the optimal basis of model weights and task-specific coefficients for linear combination. Extensive experiments across diverse datasets and model scales demonstrate that RMM consistently outperforms existing merging approaches, preserving the performance of low-rank compressed models by a significant margin.
104. 【2510.14146】PoissonNet: A Local-Global Approach for Learning on Surfaces
链接:https://arxiv.org/abs/2510.14146
作者:Arman Maesumi,Tanish Makadia,Thibault Groueix,Vladimir G. Kim,Daniel Ritchie,Noam Aigerman
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:entail delicate trade-offs, difficulty learning high-frequency, constructions entail delicate, inefficient computational overhead, insufficient receptive field
备注: In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 2025, 16 pages
点击查看摘要
Abstract:Many network architectures exist for learning on meshes, yet their constructions entail delicate trade-offs between difficulty learning high-frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local-global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local-global learning scheme, which uses Poisson's equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local-global learning framework preserves the features's full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability -- both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state-of-the-art performance on semantic segmentation and parameterizing highly-detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state-of-the-art architectures that learn on surfaces.
105. 【2510.14143】cubic: CUDA-accelerated 3D Bioimage Computing
链接:https://arxiv.org/abs/2510.14143
作者:Alexandr A. Kalinin,Anne E. Carpenter,Shantanu Singh,Matthew J. O'Meara
类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:understanding complex cellular, complex cellular phenotypes, multidimensional biological images, Quantitative analysis, biomedical research
备注: accepted to BioImage Computing workshop @ ICCV 2025
点击查看摘要
Abstract:Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic's API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic
106. 【2510.14081】Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
链接:https://arxiv.org/abs/2510.14081
作者:Emanuel Garbin,Guy Adam,Oded Krams,Zohar Barzelay,Eran Guendelman,Michael Schwarz,Moran Vatelmacher,Yigal Shenkman,Eli Peker,Itai Druker,Uri Patish,Yoav Blum,Max Bluvstein,Junxuan Li,Rawal Khirodkar,Shunsuke Saito
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:unstructured phone images, creating hyperrealistic, phone images, zero-shot pipeline, degrading identity preservation
备注:
点击查看摘要
Abstract:We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
107. 【2510.14051】Synchronization of Multiple Videos
链接:https://arxiv.org/abs/2510.14051
作者:Avihai Naaman,Ron Shapira Weber,Oren Freifeld
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simple time shifts, Synchronizing videos captured, videos captured simultaneously, time shifts, Synchronizing videos
备注: ICCV 2025
点击查看摘要
Abstract:Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at this https URL
108. 【2510.14032】Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
链接:https://arxiv.org/abs/2510.14032
作者:Xiaoqian Shen,Wenxuan Zhang,Jun Chen,Mohamed Elhoseiny
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, retaining long-term sequential, pose significant challenges, large video language, long-term sequential information
备注: NeurIPS 2025 (Spotlight). Webpage at [this https URL](https://xiaoqian-shen.github.io/Vgent)
点击查看摘要
Abstract:Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at this https URL.
109. 【2510.14025】NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
链接:https://arxiv.org/abs/2510.14025
作者:Junjie Nan,Jianing Li,Wei Chen,Mingkun Zhang,Xueqi Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved great success, achieved great, great success, success in combating, Adversarial
备注:
点击查看摘要
Abstract:Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.
110. 【2510.13995】Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer
链接:https://arxiv.org/abs/2510.13995
作者:Kelvin Szolnoky,Anders Blilie,Nita Mulliqi,Toyonori Tsuzuki,Hemamali Samaratunga,Matteo Titus,Xiaoyi Ji,Sol Erika Boman,Einar Gudlaugsson,Svein Reidar Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radisław Kordek,Roman Łowicki,Brett Delahunt,Kenneth A. Iczkowski,Theo van der Kwast,Geert J. L. H. van Leenders,Katia R. M. Leite,Chin-Chen Pan,Emiel Adrianus Maria Janssen,Martin Eklund,Lars Egevad,Kimmo Kartasalo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:contraindicates active surveillance, active surveillance, histological feature, poor prognosis, prognosis and contraindicates
备注:
点击查看摘要
Abstract:Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.13995 [cs.CV]
(or
arXiv:2510.13995v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.13995
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Kelvin Szolnoky [view email] [v1]
Wed, 15 Oct 2025 18:23:34 UTC (743 KB)
111. 【2510.13993】Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models
链接:https://arxiv.org/abs/2510.13993
作者:Jia Yun Chua,Argyrios Zolotas,Miguel Arana-Catania
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Remote sensing, environmental monitoring, urban planning, disaster response, Vision Language Models
备注: 11 pages, 7 figures, 8 tables. To be published in Applied AI Letters
点击查看摘要
Abstract:Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.
112. 【2510.13972】Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems
链接:https://arxiv.org/abs/2510.13972
作者:George Webber,Andrew J. Reader
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:Recovering true signals, Recovering true, spanning medical imaging, central challenge, signal processing
备注: Preprint; submitted to ICLR 2025 for possible publication
点击查看摘要
Abstract:Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current solutions balance prior assumptions regarding the true signal (regularization) with agreement to noisy measured data (data-fidelity). Conventional data-fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data-fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We adopt this aggregated perspective and introduce distributional consistency (DC) loss, a data-fidelity objective that replaces pointwise matching with distribution-level calibration using model-based probability scores for each measurement. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern regularizers, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise even without the use of priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for inverse problems.
113. 【2510.13921】Weight Weaving: Parameter Pooling for Data-Free Model Merging
链接:https://arxiv.org/abs/2510.13921
作者:Levy Chaves,Eduardo Valle,Sandra Avila
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:specialized deep neural, deep neural networks, Model merging, model merging methods, parameter integration
备注: 17 pages, 3 figures. Accepted at the 3rd UniReps Workshop @ NeurIPS 2025
点击查看摘要
Abstract:Model merging provides a cost-effective and data-efficient combination of specialized deep neural networks through parameter integration. This technique leverages expert models across downstream tasks without requiring retraining. Most model merging approaches critically depend on scaling hyper-parameters $\lambda$, which weight each model's contribution globally or individually. Principled approaches for setting scaling factors without accessing any data (data-free) are scarce, often leading researchers to tune $\lambda$ using privileged data from the evaluation set, which is obviously unfeasible in practice. To address this limitation, we introduce Weight Weaving, a plug-and-play technique that pools model weights across $\lambda$ values search space using user-defined pooling functions, such as averaging, random selection, or even existing model merging methods. Our method demonstrates high modularity, imposing minimal constraints on the search space. It operates orthogonally to existing model merging methods and eliminates evaluation data requirements. We validate Weight Weaving across three ViT variants in three experimental setups: vision multi-task learning, vision continual learning, and domain generalization. Our method consistently improves the performance of several model merging methods, achieving average accuracy gains of up to 15.9 percentage points in a data-free setting.
114. 【2510.13899】Post-surgical Endometriosis Segmentation in Laparoscopic Videos
链接:https://arxiv.org/abs/2510.13899
作者:Andreas Leibetseder,Klaus Schoeffmann,Jörg Keckstein,Simon Keckstein
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:common women condition, women condition exhibiting, body-internal locations, manifold visual appearance, common women
备注: This is a demo paper that was already published [this https URL](https://ieeexplore.ieee.org/document/9461900) but a preprint/author's copy is needed for the funding agency
点击查看摘要
Abstract:Endometriosis is a common women's condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.
115. 【2510.13889】MultiFoodhat: A potential new paradigm for intelligent food quality inspection
链接:https://arxiv.org/abs/2510.13889
作者:Yue Hu,Guohang Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:image classification plays, Food image classification, dietary assessment, automated monitoring, image classification
备注:
点击查看摘要
Abstract:Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.
116. 【2510.13864】Self-Training with Dynamic Weighting for Robust Gradual Domain Adaptation
链接:https://arxiv.org/abs/2510.13864
作者:Zixi Wang,Yushe Cao,Yubo Huang,Jinzhu Wei,Jingzehua Xu,Shuai Zhang,Xin Lai
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:smooth knowledge migration, Traditional GDA methods, Gradual Domain Adaptation, method called Self-Training, GDA methods mitigate
备注: It had formerly appeared as [arXiv:2501.19159v2](https://arxiv.org/abs/2501.19159v2) in error. Accepted by NIPS 25
点击查看摘要
Abstract:In this paper, we propose a new method called Self-Training with Dynamic Weighting (STDW), which aims to enhance robustness in Gradual Domain Adaptation (GDA) by addressing the challenge of smooth knowledge migration from the source to the target domain. Traditional GDA methods mitigate domain shift through intermediate domains and self-training but often suffer from inefficient knowledge migration or incomplete intermediate data. Our approach introduces a dynamic weighting mechanism that adaptively balances the loss contributions of the source and target domains during training. Specifically, we design an optimization framework governed by a time-varying hyperparameter $\varrho$ (progressing from 0 to 1), which controls the strength of domain-specific learning and ensures stable adaptation. The method leverages self-training to generate pseudo-labels and optimizes a weighted objective function for iterative model updates, maintaining robustness across intermediate domains. Experiments on rotated MNIST, color-shifted MNIST, portrait datasets, and the Cover Type dataset demonstrate that STDW outperforms existing baselines. Ablation studies further validate the critical role of $\varrho$'s dynamic scheduling in achieving progressive adaptation, confirming its effectiveness in reducing domain bias and improving generalization. This work provides both theoretical insights and a practical framework for robust gradual domain adaptation, with potential applications in dynamic real-world scenarios. The code is available at this https URL.
117. 【2510.13856】Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
链接:https://arxiv.org/abs/2510.13856
作者:A H M Rezaul Karim,Ozlem Uzuner
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Question Answering, Medical Visual Question, Question Answering, enables natural language, support clinical decision-making
备注:
点击查看摘要
Abstract:Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.
118. 【2509.26255】ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning
链接:https://arxiv.org/abs/2509.26255
作者:Yichao Liang,Dat Nguyen,Cambridge Yang,Tianyang Li,Joshua B. Tenenbaum,Carl Edward Rasmussen,Adrian Weller,Zenna Tavares,Tom Silver,Kevin Ellis
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Long-horizon embodied planning, Long-horizon embodied, water heating, dominoes cascading, agent actions
备注: 41 pages. The last two authors contributed equally in co-advising
点击查看摘要
Abstract:Long-horizon embodied planning is challenging because the world does not only change through an agent's actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent's actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic cause-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
119. 【2509.25991】owards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline
链接:https://arxiv.org/abs/2509.25991
作者:Haiyang Li,Yaxiong Wang,Shengeng Tang,Lianwei Wu,Lechao Cheng,Zhun Zhong
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:drawn increasing attention, recent years, increasing attention, social media, media has drawn
备注:
点击查看摘要
Abstract:In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.
120. 【2510.14340】A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities
链接:https://arxiv.org/abs/2510.14340
作者:Siva Teja Kakileti,Bharath Govindaraju,Sudhakar Sampangi,Geetha Manjunath
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:contributing to missed, delayed diagnoses, current standard, missed or delayed, Thermalytix
备注:
点击查看摘要
Abstract:Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.
121. 【2510.14244】Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation
链接:https://arxiv.org/abs/2510.14244
作者:Arnaud Judge,Nicolas Duchateau,Thierry Judge,Roman A. Sandler,Joseph Z. Sokol,Christian Desrosiers,Olivier Bernard,Pierre-Marc Jodoin
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:additional expert annotations, enabling knowledge transfer, adaptation methods aim, Domain adaptation methods, expert annotations
备注: 10 pages, submitted to IEEE TMI
点击查看摘要
Abstract:Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at this https URL.
122. 【2510.13896】GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents
链接:https://arxiv.org/abs/2510.13896
作者:Xi Yu,Yang Yang,Qun Liu,Yonghua Du,Sean McSweeney,Yuewei Lin
类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:remains difficult due, morphological variability, heterogeneous modalities, essential for quantitative, quantitative biology
备注: 43 pages
点击查看摘要
Abstract:Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training-free multi-agent framework that orchestrates specialist segmenters and generalist vision-language models via a planner-executor-evaluator loop (choose tool $\rightarrow$ run $\rightarrow$ quality-check) with long-term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text-guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self-evolution and personalized workflows. Across four cell-segmentation benchmarks, this routing yields a 15.7\% mean accuracy gain over state-of-the-art baselines. On endoplasmic reticulum and mitochondria from new datasets, GenCellAgent improves average IoU by 37.6\% over specialist models. It also segments novel objects such as the Golgi apparatus via iterative text-guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.