本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新596篇论文,其中:
- 自然语言处理88篇
- 信息检索8篇
- 计算机视觉115篇
自然语言处理
1. 【2510.18876】Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
链接:https://arxiv.org/abs/2510.18876
作者:Haochen Wang,Yuhao Wang,Tao Zhang,Yikang Zhou,Yanwei Li,Jiacong Wang,Ye Tian,Jiahao Meng,Zilong Huang,Guangcan Mai,Anran Wang,Yunhai Tong,Zhuochen Wang,Xiangtai Li,Zhaoxiang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
2. 【2510.18874】Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
链接:https://arxiv.org/abs/2510.18874
作者:Howard Chen,Noam Razin,Karthik Narasimhan,Danqi Chen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Adapting language models, degrading existing capabilities, Adapting language, language models, existing capabilities
备注:
点击查看摘要
Abstract:Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
3. 【2510.18871】How Do LLMs Use Their Depth?
链接:https://arxiv.org/abs/2510.18871
作者:Akshat Gupta,Jay Yeung,Gopala Anumanchipalli,Anna Ivanova
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Growing evidence suggests, Growing evidence, large language models, evidence suggests, suggests that large
备注:
点击查看摘要
Abstract:Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined 70% of the time, indicating that correct token prediction is not "one-and-done". We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.
4. 【2510.18866】LightMem: Lightweight and Efficient Memory-Augmented Generation
链接:https://arxiv.org/abs/2510.18866
作者:Jizhan Fang,Xinle Deng,Haoming Xu,Ziyan Jiang,Yuqi Tang,Ziwen Xu,Shumin Deng,Yunzhi Yao,Mengru Wang,Shuofei Qiao,Huajun Chen,Ningyu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Large Language Models, Large Language, effectively leverage historical, leverage historical interaction, Language Models
备注: Work in progress
点击查看摘要
Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at this https URL.
5. 【2510.18855】Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
链接:https://arxiv.org/abs/2510.18855
作者:Ling Team,Anqi Shen,Baihui Li,Bin Hu,Bin Jing,Cai Chen,Chao Huang,Chao Zhang,Chaokun Yang,Cheng Lin,Chengyao Wen,Congqi Li,Deng Zhao,Dingbo Yuan,Donghai You,Fagui Mao,Fanzhuang Meng,Feng Xu,Guojie Li,Guowei Wang,Hao Dai,Haonan Zheng,Hong Liu,Jia Guo,Jiaming Liu,Jian Liu,Jianhao Fu,Jiannan Shi,Jianwen Wang,Jianxin Lai,Jin Yang,Jun Mei,Jun Zhou,Junbo Zhao,Junping Zhao,Kuan Xu,Le Su,Lei Chen,Li Tang,Liang Jiang,Liangcheng Fu,Lianhao Xu,Linfeng Shi,Lisha Liao,Longfei Zheng,Meng Li,Mingchun Chen,Qi Zuo,Qiang Cheng,Qianggang Cao,Qitao Shi,Quanrui Guo,Senlin Zhu,Shaofei Wang,Shaomian Zheng,Shuaicheng Li,Shuwei Gu,Siba Chen,Tao Wu,Tao Zhang,Tianyu Zhang,Tianyu Zhou,Tiwei Bie,Tongkai Yang,Wang Hong,Wang Ren,Weihua Chen,Wenbo Yu,Wengang Zheng,Xiangchun Wang,Xiaodong Yan,Xiaopei Wan,Xin Zhao,Xinyu Kong,Xinyu Tang,Xudong Han,Xudong Wang,Xuemin Yang,Xueyu Hu,Yalin Zhang,Yan Sun,Yicheng Shan,Yilong Wang,Yingying Xu,Yongkang Liu,Yongzhen Guo,Yuanyuan Wang,Yuchen Yan,Yuefan Wang,Yuhong Guo,Zehuan Li,Zhankai Xu,Zhe Li,Zhenduo Zhang,Zhengke Gui,Zhenxuan Pan,Zhenyu Huang,Zhenzhong Lan,Zhiqiang Ding,Zhiqiang Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:trillion-scale parameter, thinking model, trillion total parameters, Abstract, model
备注: Technical Report
点击查看摘要
Abstract:We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
6. 【2510.18849】owards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
链接:https://arxiv.org/abs/2510.18849
作者:Chenghao Zhu,Meiling Tao,Tiannan Wang,Dongyi Ding,Yuchen Eleanor Jiang,Wangchunshu Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Faithfully personalizing large, personalizing large language, individual user preferences, Faithfully personalizing, large language models
备注: work in progress
点击查看摘要
Abstract:Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
7. 【2510.18840】See the Text: From Tokenization to Visual Reading
链接:https://arxiv.org/abs/2510.18840
作者:Ling Xing,Alex Jinpeng Wang,Rui Yan,Hongyu Qu,Zechao Li,Jinhui Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:People, People see text, language models, Abstract, large language models
备注:
点击查看摘要
Abstract:People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
8. 【2510.18830】MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
链接:https://arxiv.org/abs/2510.18830
作者:Wenxuan Li,Chengruidong Zhang,Huiqiang Jiang,Yucheng Li,Yuqing Yang,Lili Qiu
类目:Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Dynamic sparse attention, Dynamic sparse, extended contexts significantly
备注:
点击查看摘要
Abstract:The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at this https URL.
9. 【2510.18817】Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring
链接:https://arxiv.org/abs/2510.18817
作者:Shuxin Lin,Dhaval Patel,Christodoulos Constantinides
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Small Language Models, lower computational requirements, Small Language, Large Language Models, Language Models
备注: Accepted at EMNLP 2025
点击查看摘要
Abstract:Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: this https URL.
10. 【2510.18798】WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection
链接:https://arxiv.org/abs/2510.18798
作者:Guanzhong He,Zhen Yang,Jinxin Liu,Bin Xu,Lei Hou,Juanzi Li
类目:Computation and Language (cs.CL)
关键词:achieved significant advancements, enabling intelligent information, intelligent information retrieval, intelligent search agent, achieved significant
备注:
点击查看摘要
Abstract:Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at this https URL
11. 【2510.18779】KAT-Coder Technical Report
链接:https://arxiv.org/abs/2510.18779
作者:Zizheng Zhan,Ken Deng,Xiaojiang Zhang,Jinghui Wang,Huaixi Tang,Zhiyi Lai,Haoyang Huang,Wen Xiang,Kun Wu,Wenhao Zhuang,Minglei Zhang,Shaojie Wang,Shangpeng Yan,Kepeng Lei,Zongxian Feng,Huiming Wang,Zheng Lin,Mengtong Li,Mengfei Xie,Yinghan Cui,Xuxing Chen,Chao Wang,Weihao Li,Wenqiang Zhu,Jiarong Zhang,Jingxuan Xu,Songwei Yu,Yifan Yao,Xinping Lei,Han Li,Junqi Xiong,Zuchen Gao,Dailin Li,Haimo Li,Jiaheng Liu,Yuqun Zhang,Junyi Peng,Haotian Zhang,Bin Chen
类目:Computation and Language (cs.CL)
关键词:Recent advances, models autonomously reason, autonomously reason, advances in large, enabled progress
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on this https URL.
12. 【2510.18774】AI use in American newspapers is widespread, uneven, and rarely disclosed
链接:https://arxiv.org/abs/2510.18774
作者:Jenna Russell,Marzena Karpinska,Destiny Akinode,Katherine Thai,Bradley Emi,Max Spero,Mohit Iyyer
类目:Computation and Language (cs.CL)
关键词:articles remains unclear, American newspapers published, rapidly transforming journalism, newspaper articles remains, remains unclear
备注:
点击查看摘要
Abstract:AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
13. 【2510.18745】opoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting
链接:https://arxiv.org/abs/2510.18745
作者:Taha Binhuraib,Greta Tuckute,Nicholas Blauch
类目:Computation and Language (cs.CL)
关键词:Spatial functional organization, hallmark of biological, Spatial, topographic organization, organization
备注: ICLR 2024 Workshop on Representational Alignment (Re-Align) Camera Ready
点击查看摘要
Abstract:Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into "Topoformers" with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.
14. 【2510.18731】Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
链接:https://arxiv.org/abs/2510.18731
作者:Ming Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models demonstrate, demonstrate strong capabilities, Curriculum Reinforcement Learning
备注:
点击查看摘要
Abstract:Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
15. 【2510.18725】SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish
链接:https://arxiv.org/abs/2510.18725
作者:Josh McGiff,Nikola S. Nikolov
类目:Computation and Language (cs.CL)
关键词:neural machine translation, specific tasks, neural machine, tailor large language, Irish translation
备注: 8 pages
点击查看摘要
Abstract:Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.
16. 【2510.18724】Adapting Language Balance in Code-Switching Speech
链接:https://arxiv.org/abs/2510.18724
作者:Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:achieving impressive results, large foundational models, code-switching test cases, standard benchmarks, large foundational
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
17. 【2510.18723】Bayesian Low-Rank Factorization for Robust Model Adaptation
链接:https://arxiv.org/abs/2510.18723
作者:Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Large speech foundation, speakers mix languages, speech foundation models, Large speech, handle local
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.
18. 【2510.18691】Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering
链接:https://arxiv.org/abs/2510.18691
作者:Feras AlMannaa,Talia Tseriotou,Jenny Chim,Maria Liakata
类目:Computation and Language (cs.CL)
关键词:investigate LLM comprehension, LLM comprehension capabilities, investigate LLM, clinical relevance, LLM comprehension
备注:
点击查看摘要
Abstract:This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.
19. 【2510.18684】MLMA: Towards Multilingual with Mamba Based Architectures
链接:https://arxiv.org/abs/2510.18684
作者:Mohamed Nabih Ali,Daniele Falavigna,Alessio Brutti
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:remains a challenging, challenging task, ASR, Mamba, multilingual ASR
备注: The paper is under review at ICASSP 2026
点击查看摘要
Abstract:Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture--an efficient state-space model optimized for long-context sequence processing--for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba's potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.
20. 【2510.18629】Dynamical model parameters from ultrasound tongue kinematics
链接:https://arxiv.org/abs/2510.18629
作者:Sam Kirkham,Patrycja Strycharczuk
类目:Computation and Language (cs.CL)
关键词:target positions, control of speech, articulators are driven, driven toward target, EMA
备注: Accepted for publication in JASA Express Letters
点击查看摘要
Abstract:The control of speech can be modelled as a dynamical system in which articulators are driven toward target positions. These models are typically evaluated using fleshpoint data, such as electromagnetic articulography (EMA), but recent methodological advances make ultrasound imaging a promising alternative. We evaluate whether the parameters of a linear harmonic oscillator can be reliably estimated from ultrasound tongue kinematics and compare these with parameters estimated from simultaneously-recorded EMA data. We find that ultrasound and EMA yield comparable dynamical parameters, while mandibular short tendon tracking also adequately captures jaw motion. This supports using ultrasound kinematics to evaluate dynamical articulatory models.
21. 【2510.18582】Beyond the Explicit: A Bilingual Dataset for Dehumanization Detection in Social Media
链接:https://arxiv.org/abs/2510.18582
作者:Dennis Assenmacher,Paloma Piot,Katarina Laken,David Jurgens,Claudia Wagner
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Language Processing, Natural Language, remains largely overlooked, linguistics and Natural
备注:
点击查看摘要
Abstract:Digital dehumanization, although a critical issue, remains largely overlooked within the field of computational linguistics and Natural Language Processing. The prevailing approach in current research concentrating primarily on a single aspect of dehumanization that identifies overtly negative statements as its core marker. This focus, while crucial for understanding harmful online communications, inadequately addresses the broader spectrum of dehumanization. Specifically, it overlooks the subtler forms of dehumanization that, despite not being overtly offensive, still perpetuate harmful biases against marginalized groups in online interactions. These subtler forms can insidiously reinforce negative stereotypes and biases without explicit offensiveness, making them harder to detect yet equally damaging. Recognizing this gap, we use different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit. Using crowdworkers and experts to annotate 16,000 instances on a document- and span-level, we show that our dataset covers the different dimensions of dehumanization. This dataset serves as both a training resource for machine learning models and a benchmark for evaluating future dehumanization detection techniques. To demonstrate its effectiveness, we fine-tune ML models on this dataset, achieving performance that surpasses state-of-the-art models in zero and few-shot in-context settings.
22. 【2510.18561】Large language models for folktale type automation based on motifs: Cinderella case study
链接:https://arxiv.org/abs/2510.18561
作者:Tjaša Arčon,Marko Robnik-Šikonja,Polona Tratnik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Artificial intelligence approaches, including digital humanities, Artificial intelligence, research areas, including digital
备注:
点击查看摘要
Abstract:Artificial intelligence approaches are being adapted to many research areas, including digital humanities. We built a methodology for large-scale analyses in folkloristics. Using machine learning and natural language processing, we automatically detected motifs in a large collection of Cinderella variants and analysed their similarities and differences with clustering and dimensionality reduction. The results show that large language models detect complex interactions in tales, enabling computational analysis of extensive text collections and facilitating cross-lingual comparisons.
23. 【2510.18556】Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency
链接:https://arxiv.org/abs/2510.18556
作者:Svetlana Maslenkova,Clement Christophe,Marco AF Pimentel,Tathagata Raha,Muhammad Umar Salman,Ahmed Al Mahrooqi,Avani Gupta,Shadab Khan,Ronnie Rajan,Praveenkumar Kanithi
类目:Computation and Language (cs.CL)
关键词:equitable development depends, development depends critically, training data characteristics, data characteristics influence, Large language models
备注: Accepted to EMNLP Main 2025
点击查看摘要
Abstract:Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.
24. 【2510.18510】Identity-Aware Large Language Models require Cultural Reasoning
链接:https://arxiv.org/abs/2510.18510
作者:Alistair Plum,Anne-Marie Lutgen,Christoph Purschke,Achim Rettinger
类目:Computation and Language (cs.CL)
关键词:natural language processing, Large language models, Large language, language processing, natural language
备注:
点击查看摘要
Abstract:Large language models have become the latest trend in natural language processing, heavily featuring in the digital tools we use every day. However, their replies often reflect a narrow cultural viewpoint that overlooks the diversity of global users. This missing capability could be referred to as cultural reasoning, which we define here as the capacity of a model to recognise culture-specific knowledge values and social norms, and to adjust its output so that it aligns with the expectations of individual users. Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI. When this capacity is limited or absent, models can sustain stereotypes, ignore minority perspectives, erode trust, and perpetuate hate. Recent empirical studies strongly suggest that current models default to Western norms when judging moral dilemmas, interpreting idioms, or offering advice, and that fine-tuning on survey data only partly reduces this tendency. The present evaluation methods mainly report static accuracy scores and thus fail to capture adaptive reasoning in context. Although broader datasets can help, they cannot alone ensure genuine cultural competence. Therefore, we argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. By clarifying the concept and outlining initial directions for its assessment, a foundation is laid for future systems to be able to respond with greater sensitivity to the complex fabric of human culture.
25. 【2510.18502】Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.18502
作者:Wei-Chia Chang,Yan-Ann Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:intelligent transportation systems, existing approaches struggle, newly released models, transportation systems, important task
备注: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
点击查看摘要
Abstract:Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
26. 【2510.18480】How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices
链接:https://arxiv.org/abs/2510.18480
作者:Han Peng,Peiyu Liu,Zican Dong,Daixuan Cheng,Junyi Li,Yiru Tang,Shuo Wang,Wayne Xin Zhao
类目:Computation and Language (cs.CL)
关键词:Diffusion language models, Diffusion language, yield greater efficiency, parallelable decoding process, long-dominant autoregressive
备注:
点击查看摘要
Abstract:Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.
27. 【2510.18476】Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents
链接:https://arxiv.org/abs/2510.18476
作者:Feifan Xia,Yuyang Fang,Defang Li,Yantong Xie,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language model, multi-turn social dialogue, language model, large language, multi-turn social
备注:
点击查看摘要
Abstract:We present a probabilistic intent modeling framework for large language model (LLM) agents in multi-turn social dialogue. The framework maintains a belief distribution over a partner's latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared with the Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.
28. 【2510.18475】DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP
链接:https://arxiv.org/abs/2510.18475
作者:Mariano Barone,Antonio Laudante,Giuseppe Riccio,Antonio Romano,Marco Postiglione,Vincenzo Moscato
类目:Computation and Language (cs.CL)
关键词:natural language processing, clinical decision support, biomedical natural language, adverse event monitoring, AI-assisted clinical decision
备注:
点击查看摘要
Abstract:The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an LLM-based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: this https URL.
29. 【2510.18471】CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
链接:https://arxiv.org/abs/2510.18471
作者:Xue Jiang,Yihong Dong,Mengyang Liu,Hongyi Deng,Tian Wang,Yongding Tao,Rongyu Cao,Binhua Li,Zhi Jin,Wenpin Jiao,Fei Huang,Yongbin Li,Ge Li
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, vast code corpora, execution semantics
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code's textual representations and its underlying execution semantics.
30. 【2510.18468】IMB: An Italian Medical Benchmark for Question Answering
链接:https://arxiv.org/abs/2510.18468
作者:Antonio Romano,Giuseppe Riccio,Mariano Barone,Marco Postiglione,Vincenzo Moscato
类目:Computation and Language (cs.CL)
关键词:professional healthcare advice, generating vast amounts, patients seek professional, seek professional healthcare, Online medical forums
备注:
点击查看摘要
Abstract:Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: this https URL.
31. 【2510.18466】CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning
链接:https://arxiv.org/abs/2510.18466
作者:Masato Kikuchi,Masatsugu Ono,Toshioki Soga,Tetsu Tanabe,Tadachika Ozono
类目:Computation and Language (cs.CL)
关键词:valuable resource owing, Common European Framework, structured semantic networks, fine-grained sense distinctions, Vocabulary Profile Online
备注:
点击查看摘要
Abstract:Although WordNet is a valuable resource owing to its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this, we developed a WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our method, we constructed a large-scale corpus containing both sense and CEFR-level information from our annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on our corpus perform comparably to those trained on gold-standard annotations. Furthermore, by combining our corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81, indicating the high accuracy of our annotations. Our annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.
32. 【2510.18462】DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
链接:https://arxiv.org/abs/2510.18462
作者:Xiangyu Hong,Che Jiang,Kai Tian,Biqing Qi,Youbang Sun,Ning Ding,Bowen Zhou
类目:Computation and Language (cs.CL)
关键词:Attributing the behavior, internal computations, central challenge, challenge in mechanistic, Attributing
备注:
点击查看摘要
Abstract:Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
33. 【2510.18455】ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
链接:https://arxiv.org/abs/2510.18455
作者:Liyang He,Yuren Zhang,Ziwei Zhu,Zhenghui Li,Shiwei Tong
类目:Computation and Language (cs.CL)
关键词:Retrieval Augmented Generation, Retrieval Augmented, impeded standardized evaluation, Augmented Generation, systems are increasingly
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: this https URL.
34. 【2510.18454】Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
链接:https://arxiv.org/abs/2510.18454
作者:Atharvan Dogra,Soumya Suvra Ghosal,Ameet Deshpande,Ashwin Kalyan,Dinesh Manocha
类目:Computation and Language (cs.CL)
关键词:raising safety concerns, Large language models, Large language, raising safety, creative writing
备注:
点击查看摘要
Abstract:Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.
35. 【2510.18439】Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
链接:https://arxiv.org/abs/2510.18439
作者:Yasser Hamidullah,Koel Dutta Chowdury,Yusser Al-Ghussin,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
类目:Computation and Language (cs.CL)
关键词:generate fluent text, fluent text unsupported, sign language translation, models generate fluent, remains a major
备注:
点击查看摘要
Abstract:Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.
36. 【2510.18434】Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response
链接:https://arxiv.org/abs/2510.18434
作者:Qingqing Gu,Dan Wang,Yue Zhao,Xiaoyu Wang,Zhonglin Jiang,Yong Chen,Hongyan Li,Luo Ji
类目:Computation and Language (cs.CL)
关键词:capability in math, widely applied, applied to improve, LLM capability, LLM
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) is widely applied to improve the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks since there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose another prompt-based paradigm called Chain of Conceptual Thought (CoCT), where the LLM first tags a concept, then generates the detailed content. The chain of concepts is allowed within the utterance, encouraging the LLM's deep and strategic thinking. We experiment with this paradigm in daily and emotional support conversations where the concept is comprised of emotions, strategies and topics. Automatic, human and model evaluations suggest that CoCT surpasses baselines such as Self-Refine, ECoT, ToT, SoT and RAG, suggesting a potential effective prompt-based paradigm of LLM for a wider scope of tasks.
37. 【2510.18413】Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
链接:https://arxiv.org/abs/2510.18413
作者:Siyuan Yan,Guo-Qing Jiang,Yuchen Zhang,Xiaoxing Ma,Ran Zhu,Chun Cao,Jingwei Xu
类目:Computation and Language (cs.CL)
关键词:Large language models, large-scale code synthesis, persistent multi-turn dialogue, multi-document question answering, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
38. 【2510.18383】MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
链接:https://arxiv.org/abs/2510.18383
作者:ChangSu Choi,Hoyun Song,Dongyeon Kim,WooHyeon Jung,Minkyung Cho,Sunjin Park,NohHyeob Bae,Seona Yu,KyungTae Lim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:efficient small language, large language models, small language models, Distilling the tool-using, language models
备注:
点击查看摘要
Abstract:Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
39. 【2510.18374】owards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning
链接:https://arxiv.org/abs/2510.18374
作者:Monorama Swain,Bubai Maji,Jagabandhu Mishra,Markus Schedl,Anders Søgaard,Jesper Rindom Jensen
类目:Computation and Language (cs.CL)
关键词:fair English ASR, English ASR systems, building fair English, English ASR, fair English
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.
40. 【2510.18368】KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
链接:https://arxiv.org/abs/2510.18368
作者:Donghyeon Ko,Yeguk Jin,Kyubyung Chae,Byungwook Lee,Chansong Jo,Sookyo In,Jaehong Lee,Taesup Kim,Donghyun Kwak
类目:Computation and Language (cs.CL)
关键词:large language models, Korean cultural knowledge, benchmark for evaluating, evaluating factuality, factuality in large
备注:
点击查看摘要
Abstract:We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at this https URL.
41. 【2510.18355】KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
链接:https://arxiv.org/abs/2510.18355
作者:Mohd Ruhul Ameen,Akif Islam,Farjana Aktar,M. Saifuzzaman Rafat
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:Optical Character Recognition, accessing timely, continue to face, face challenges, challenges in accessing
备注: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE 2025)
点击查看摘要
Abstract:In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
42. 【2510.18344】Combining Distantly Supervised Models with In Context Learning for Monolingual and Cross-Lingual Relation Extraction
链接:https://arxiv.org/abs/2510.18344
作者:Vipul Rathore,Malik Hammad Faisal,Parag Singla,Mausam
类目:Computation and Language (cs.CL)
关键词:Distantly Supervised Relation, Supervised Relation Extraction, Distantly Supervised, making sentence-level predictions, HYbrid Distantly Supervised
备注:
点击查看摘要
Abstract:Distantly Supervised Relation Extraction (DSRE) remains a long-standing challenge in NLP, where models must learn from noisy bag-level annotations while making sentence-level predictions. While existing state-of-the-art (SoTA) DSRE models rely on task-specific training, their integration with in-context learning (ICL) using large language models (LLMs) remains underexplored. A key challenge is that the LLM may not learn relation semantics correctly, due to noisy annotation. In response, we propose HYDRE -- HYbrid Distantly Supervised Relation Extraction framework. It first uses a trained DSRE model to identify the top-k candidate relations for a given test sentence, then uses a novel dynamic exemplar retrieval strategy that extracts reliable, sentence-level exemplars from training data, which are then provided in LLM prompt for outputting the final relation(s). We further extend HYDRE to cross-lingual settings for RE in low-resource languages. Using available English DSRE training data, we evaluate all methods on English as well as a newly curated benchmark covering four diverse low-resource Indic languages -- Oriya, Santali, Manipuri, and Tulu. HYDRE achieves up to 20 F1 point gains in English and, on average, 17 F1 points on Indic languages over prior SoTA DSRE models. Detailed ablations exhibit HYDRE's efficacy compared to other prompting strategies.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2510.18344 [cs.CL]
(or
arXiv:2510.18344v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.18344
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
43. 【2510.18339】ECG-LLM -- training and evaluation of domain-specific large language models for electrocardiography
链接:https://arxiv.org/abs/2510.18339
作者:Lara Ahrens,Wilhelm Haverkamp,Nils Strodthoff
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Domain-adapted open-weight large, offer promising healthcare, promising healthcare applications, open-weight large language, Domain-adapted open-weight
备注: 34 pages, 8 figures, code available at [this https URL](https://github.com/AI4HealthUOL/ecg-llm)
点击查看摘要
Abstract:Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
44. 【2510.18333】Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
链接:https://arxiv.org/abs/2510.18333
作者:Yepeng Liu,Xuandong Zhao,Dawn Song,Gregory W. Wornell,Yuheng Bu
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:large language models, real-world deployment remains, deployment remains limited, real-world deployment, algorithms for large
备注:
点击查看摘要
Abstract:Despite progress in watermarking algorithms for large language models (LLMs), real-world deployment remains limited. We argue that this gap stems from misaligned incentives among LLM providers, platforms, and end users, which manifest as four key barriers: competitive risk, detection-tool governance, robustness concerns and attribution issues. We revisit three classes of watermarking through this lens. \emph{Model watermarking} naturally aligns with LLM provider interests, yet faces new challenges in open-source ecosystems. \emph{LLM text watermarking} offers modest provider benefit when framed solely as an anti-misuse tool, but can gain traction in narrowly scoped settings such as dataset de-contamination or user-controlled provenance. \emph{In-context watermarking} (ICW) is tailored for trusted parties, such as conference organizers or educators, who embed hidden watermarking instructions into documents. If a dishonest reviewer or student submits this text to an LLM, the output carries a detectable watermark indicating misuse. This setup aligns incentives: users experience no quality loss, trusted parties gain a detection tool, and LLM providers remain neutral by simply following watermark instructions. We advocate for a broader exploration of incentive-aligned methods, with ICW as an example, in domains where trusted parties need reliable tools to detect misuse. More broadly, we distill design principles for incentive-aligned, domain-specific watermarking and outline future research directions. Our position is that the practical adoption of LLM watermarking requires aligning stakeholder incentives in targeted application domains and fostering active community engagement.
45. 【2510.18304】he Impact of Image Resolution on Biomedical Multimodal Large Language Models
链接:https://arxiv.org/abs/2510.18304
作者:Liangyu Chen,James Burgess,Jeffrey J Nirschl,Orr Zohar,Serena Yeung-Levy
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Imaging technologies, modern medicine, technologies are fundamental, requiring analysis, biomedical image analysis
备注: Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025
点击查看摘要
Abstract:Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
46. 【2510.18297】From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering
链接:https://arxiv.org/abs/2510.18297
作者:Lei Li,Xiao Zhou,Yingying Zhang,Xian Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Medical question answering, requires extensive access, question answering, access to domain-specific, knowledge
备注: 13 pages, 4 figures
点击查看摘要
Abstract:Medical question answering (QA) requires extensive access to domain-specific knowledge. A promising direction is to enhance large language models (LLMs) with external knowledge retrieved from medical corpora or parametric knowledge stored in model parameters. Existing approaches typically fall into two categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning on externally retrieved evidence, and Generation-Augmented Generation (GAG), which depends solely on the models internal knowledge to generate contextual documents. However, RAG often suffers from noisy or incomplete retrieval, while GAG is vulnerable to hallucinated or inaccurate information due to unconstrained generation. Both issues can mislead reasoning and undermine answer reliability. To address these challenges, we propose MedRGAG, a unified retrieval-generation augmented framework that seamlessly integrates external and parametric knowledge for medical QA. MedRGAG comprises two key modules: Knowledge-Guided Context Completion (KGCC), which directs the generator to produce background documents that complement the missing knowledge revealed by retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively selects an optimal combination of retrieved and generated documents to form concise yet comprehensive evidence for answer generation. Extensive experiments on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5% improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the effectiveness of unifying retrieval and generation for knowledge-intensive reasoning. Our code and data are publicly available at this https URL
47. 【2510.18289】Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata
链接:https://arxiv.org/abs/2510.18289
作者:Zhengqing Yuan,Yiyang Li,Weixiang Sun,Zheyuan Zhang,Kaiwen Shi,Keerthiram Murugesan,Yanfang Ye
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
关键词:United States, persistent public health, public health emergency, mental illness, tightly interwoven
备注:
点击查看摘要
Abstract:Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.
48. 【2510.18288】BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks
链接:https://arxiv.org/abs/2510.18288
作者:Tianyuan Huang,Zepeng Zhu,Hangdi Xing,Zirui Shao,Zhi Yu,Chaoxiong Yang,Jiaxian He,Xiaozhong Liu,Jiajun Bu
类目:Computation and Language (cs.CL)
关键词:visually impaired individuals, Chinese Braille Mixed, Braille, impaired individuals, Braille Mixed Datasets
备注: Accepted to EMNLP 2025
点击查看摘要
Abstract:Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.
49. 【2510.18279】xt or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
链接:https://arxiv.org/abs/2510.18279
作者:Yanhong Li,Zixuan Lan,Jiawei Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, multimodal variants, process visual inputs, Large
备注: Accepted to EMNLP 2025 Findings. Previously titled "Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs"
点击查看摘要
Abstract:Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
50. 【2510.18257】DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization
链接:https://arxiv.org/abs/2510.18257
作者:Tao Tao,Guanghui Zhu,Lang Guo,Hongyi Chen,Chunfeng Yuan,Yihua Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, steering Large Language, Large Language, Language Models, crucial approach due
备注:
点击查看摘要
Abstract:Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose $\textbf{DelvePO}$ ($\textbf{D}$irection-Guid$\textbf{e}$d Se$\textbf{l}$f-E$\textbf{v}$olving Framework for Fl$\textbf{e}$xible $\textbf{P}$rompt $\textbf{O}$ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks. On this basis, we introduce working memory, through which LLMs can alleviate the deficiencies caused by their own uncertainties and further obtain key insights to guide the generation of new prompts. Extensive experiments conducted on different tasks covering various domains for both open- and closed-source LLMs, including DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct and GPT-4o-mini. Experimental results show that DelvePO consistently outperforms previous SOTA methods under identical experimental settings, demonstrating its effectiveness and transferability across different tasks.
51. 【2510.18214】VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
链接:https://arxiv.org/abs/2510.18214
作者:Shruti Palaskar,Leon Gatys,Mona Abdelrahman,Mar Jacobo,Larry Lindsey,Rutika Moharir,Gunnar Lund,Yang Xu,Navid Shiee,Jeffrey Bigham,Charles Maalouf,Joseph Yitan Cheng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:language inputs separately, inputs separately, missing risks, Vision Language Safety, interpretation where benign
备注: 10 pages, 5 figures, 4 tables. Under review
点击查看摘要
Abstract:Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
52. 【2510.18201】MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives
链接:https://arxiv.org/abs/2510.18201
作者:Sriharsh Bhyravajjula,Ujwal Narayan,Manish Shrivastava
类目:Computation and Language (cs.CL)
关键词:understand character journeys, important theoretical devices, theoretical devices employed, literary genres, identify tropes
备注:
点击查看摘要
Abstract:Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work.
53. 【2510.18196】Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
链接:https://arxiv.org/abs/2510.18196
作者:Yoshinari Fujinuma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, LLM judge outputs, outcomes remains
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.
54. 【2510.18173】CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models
链接:https://arxiv.org/abs/2510.18173
作者:Ritam Upadhyay,Naman Ahuja,Rishabh Baral,Aparna Garimella,Vivek Gupta
类目:Computation and Language (cs.CL)
关键词:summarise key information, iterative event extraction, temporal evolving narratives, systems often rely, code-parsable formats
备注:
点击查看摘要
Abstract:LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.
55. 【2510.18165】Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
链接:https://arxiv.org/abs/2510.18165
作者:Yihong Dong,Zhaoyu Ma,Xue Jiang,Zhiyuan Fan,Jiaru Qian,Yongmin Li,Jianha Xiao,Zhi Jin,Rongyu Cao,Binhua Li,Fei Huang,Yongbin Li,Ge Li
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:Diffusion language models, Diffusion language, bidirectional context modeling, dominant autoregressive paradigm, code generation
备注:
点击查看摘要
Abstract:Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
56. 【2510.18162】Automatic Prompt Generation via Adaptive Selection of Prompting Techniques
链接:https://arxiv.org/abs/2510.18162
作者:Yohei Ikenoue,Hitomi Tashiro,Shigeru Kuroyanagi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, design requires specialized, requires specialized knowledge, language models, prompting techniques
备注: 35 pages, 29 figures, 5 tables
点击查看摘要
Abstract:Prompt engineering is crucial for achieving reliable and effective outputs from large language models (LLMs), but its design requires specialized knowledge of prompting techniques and a deep understanding of target tasks. To address this challenge, we propose a novel method that adaptively selects task-appropriate prompting techniques based on users' abstract task descriptions and automatically generates high-quality prompts without relying on pre-existing templates or frameworks. The proposed method constructs a knowledge base that associates task clusters, characterized by semantic similarity across diverse tasks, with their corresponding prompting techniques. When users input task descriptions, the system assigns them to the most relevant task cluster and dynamically generates prompts by integrating techniques drawn from the knowledge base. An experimental evaluation of the proposed method on 23 tasks from BIG-Bench Extra Hard (BBEH) demonstrates superior performance compared with standard prompts and existing automatic prompt-generation tools, as measured by both arithmetic and harmonic mean scores. This research establishes a foundation for streamlining and standardizing prompt creation, enabling non-experts to effectively leverage LLMs.
57. 【2510.18148】Extracting Rule-based Descriptions of Attention Features in Transformers
链接:https://arxiv.org/abs/2510.18148
作者:Dan Friedman,Adithya Bhaskar,Alexander Wettig,Danqi Chen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Mechanistic interpretability strives, Mechanistic interpretability, explain model behavior, bottom-up primitives, interpretability strives
备注: Our code is available at [this https URL](https://github.com/princeton-nlp/AttentionRules)
点击查看摘要
Abstract:Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form "[Canadian city]... speaks -- English", (2) absence rules of the form "[Montreal]... speaks -/- English," and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.
58. 【2510.18147】LLMs Encode How Difficult Problems Are
链接:https://arxiv.org/abs/2510.18147
作者:William Lugoloobi,Chris Russell
类目:Computation and Language (cs.CL)
关键词:Large language models, solve complex problems, Large language, puzzling inconsistency, solve complex
备注:
点击查看摘要
Abstract:Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: $\rho \approx 0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward "easier" representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.
59. 【2510.18123】SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving
链接:https://arxiv.org/abs/2510.18123
作者:Xiangbo Gao,Tzu-Hsiang Lin,Ruojing Song,Yuheng Wu,Kuan-Ru Huang,Zicheng Jin,Fangzhou Lin,Shinan Liu,Zhengzhong Tu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:driving systems leverage, multiple agents, agents to enhance, enhance driving safety, Collaborative driving systems
备注:
点击查看摘要
Abstract:Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is this https URL.
60. 【2510.18112】Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment
链接:https://arxiv.org/abs/2510.18112
作者:Patricia Delafuente,Arya Honraopatil,Lara J. Martin
类目:Computation and Language (cs.CL)
关键词:predict Dungeons Dragons, Avrae Discord bot, Large Language Models, Discord bot commands, Dungeons Dragons
备注: Published at the Wordplay: When Language Meets Games Workshop (EMNLP 2025)
点击查看摘要
Abstract:This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.
61. 【2510.18108】Na Prática, qual IA Entende o Direito? Um Estudo Experimental com IAs Generalistas e uma IA Jurídica
链接:https://arxiv.org/abs/2510.18108
作者:Marina Soares Marinho,Daniela Vianna,Livy Real,Altigran da Silva,Gabriela Migliorini
类目:Computation and Language (cs.CL)
关键词:Jusbrasil Study, AIs in Law, presents the Jusbrasil, combining legal theory, systematic coherence
备注: 22 pages, in Portuguese language
点击查看摘要
Abstract:This study presents the Jusbrasil Study on the Use of General-Purpose AIs in Law, proposing an experimental evaluation protocol combining legal theory, such as material correctness, systematic coherence, and argumentative integrity, with empirical assessment by 48 legal professionals. Four systems (JusIA, ChatGPT Free, ChatGPT Pro, and Gemini) were tested in tasks simulating lawyers' daily work. JusIA, a domain-specialized model, consistently outperformed the general-purpose systems, showing that both domain specialization and a theoretically grounded evaluation are essential for reliable legal AI outputs.
62. 【2510.18095】SMaRT: Select, Mix, and ReinvenT -- A Strategy Fusion Framework for LLM-Driven Reasoning and Planning
链接:https://arxiv.org/abs/2510.18095
作者:Nikhil Verma,Manasa Bharadwaj,Wonjun Jang,Harmanpreet Singh,Yixiao Wang,Homa Fashandi,Chul Lee
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, exceptional generalization capabilities, redefined complex task
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have redefined complex task automation with exceptional generalization capabilities. Despite these advancements, state-of-the-art methods rely on single-strategy prompting, missing the synergy of diverse reasoning approaches. No single strategy excels universally, highlighting the need for frameworks that fuse strategies to maximize performance and ensure robustness. We introduce the Select, Mix, and ReinvenT (SMaRT) framework, an innovative strategy fusion approach designed to overcome this constraint by creating balanced and efficient solutions through the seamless integration of diverse reasoning strategies. Unlike existing methods, which employ LLMs merely as evaluators, SMaRT uses them as intelligent integrators, unlocking the "best of all worlds" across tasks. Extensive empirical evaluations across benchmarks in reasoning, planning, and sequential decision-making highlight the robustness and adaptability of SMaRT. The framework consistently outperforms state-of-the-art baselines in solution quality, constraint adherence, and performance metrics. This work redefines LLM-driven decision-making by pioneering a new paradigm in cross-strategy calibration, unlocking superior outcomes for reasoning systems and advancing the boundaries of self-refining methodologies.
63. 【2510.18077】Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models
链接:https://arxiv.org/abs/2510.18077
作者:Shabnam Ataee,Andrei Popescu-Belis
类目:Computation and Language (cs.CL)
关键词:include inter-sentential dependencies, large language models, inter-sentential dependencies, paper assesses, assesses the capacity
备注:
点击查看摘要
Abstract:This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a "wise get wiser" effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.
64. 【2510.18054】HouseTour: A Virtual Real Estate A(I)gent
链接:https://arxiv.org/abs/2510.18054
作者:Ata Çelen,Marc Pollefeys,Daniel Barath,Iro Armeni
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:natural language summary, language summary generation, natural language, language summary, collection of images
备注: Published on ICCV 2025
点击查看摘要
Abstract:We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
65. 【2510.18046】Language Models as Semantic Augmenters for Sequential Recommenders
链接:https://arxiv.org/abs/2510.18046
作者:Mahsa Valizadeh,Xiangjue Dong,Rui Tuo,James Caverlee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, excel at capturing, diverse modalities, Large
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) excel at capturing latent semantics and contextual relationships across diverse modalities. However, in modeling user behavior from sequential interaction data, performance often suffers when such semantic context is limited or absent. We introduce LaMAR, a LLM-driven semantic enrichment framework designed to enrich such sequences automatically. LaMAR leverages LLMs in a few-shot setting to generate auxiliary contextual signals by inferring latent semantic aspects of a user's intent and item relationships from existing metadata. These generated signals, such as inferred usage scenarios, item intents, or thematic summaries, augment the original sequences with greater contextual depth. We demonstrate the utility of this generated resource by integrating it into benchmark sequential modeling tasks, where it consistently improves performance. Further analysis shows that LLM-generated signals exhibit high semantic novelty and diversity, enhancing the representational capacity of the downstream models. This work represents a new data-centric paradigm where LLMs serve as intelligent context generators, contributing a new method for the semi-automatic creation of training data and language resources.
66. 【2510.18040】Subject-Event Ontology Without Global Time: Foundations and Execution Semantics
链接:https://arxiv.org/abs/2510.18040
作者:Alexander Boldachev
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
关键词:modeling complex dynamic, complex dynamic systems, proposed for modeling, modeling complex, complex dynamic
备注: 32 pages
点击查看摘要
Abstract:A formalization of a subject-event ontology is proposed for modeling complex dynamic systems without reliance on global time. Key principles: (1) event as an act of fixation - a subject discerns and fixes changes according to models (conceptual templates) available to them; (2) causal order via happens-before - the order of events is defined by explicit dependencies, not timestamps; (3) making the ontology executable via a declarative dataflow mechanism, ensuring determinism; (4) models as epistemic filters - a subject can only fix what falls under its known concepts and properties; (5) presumption of truth - the declarative content of an event is available for computation from the moment of fixation, without external verification. The formalization includes nine axioms (A1-A9), ensuring the correctness of executable ontologies: monotonicity of history (I1), acyclicity of causality (I2), traceability (I3). Special attention is given to the model-based approach (A9): event validation via schemas, actor authorization, automatic construction of causal chains (W3) without global time. Practical applicability is demonstrated on the boldsea system - a workflow engine for executable ontologies, where the theoretical constructs are implemented in BSL (Boldsea Semantic Language). The formalization is applicable to distributed systems, microservice architectures, DLT platforms, and multiperspectivity scenarios (conflicting facts from different subjects).
67. 【2510.18030】From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models
链接:https://arxiv.org/abs/2510.18030
作者:Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Minwoo Lee,Shu-ping Yeh,Evgeny Stupachenko,Hao Feng,Li Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, deploying large language, hardware-friendly architectures, yields compact, Iterative Structured Pruning-a
备注: 16 pages, 4 figures
点击查看摘要
Abstract:Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.
68. 【2510.18019】Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution
链接:https://arxiv.org/abs/2510.18019
作者:Asim Mohamed,Martin Gubri
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, make large language, Multilingual watermarking aims, outputs traceable, fall short
备注:
点击查看摘要
Abstract:Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a back-translation-based detection method that restores watermark strength lost through translation. STEAM is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, STEAM provides a simple and robust path toward fairer watermarking across diverse languages.
69. 【2510.17998】SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone
链接:https://arxiv.org/abs/2510.17998
作者:Nishant Subramani,Alfredo Gomez,Mona Diab
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Modern language models, Modern language, Simplify Benchmark Analysis, evaluated on large, difficult to make
备注: EMNLP 2025 Findings
点击查看摘要
Abstract:Modern language models are evaluated on large benchmarks, which are difficult to make sense of, especially for model selection. Looking at the raw evaluation numbers themselves using a model-centric lens, we propose SimBA, a three phase framework to Simplify Benchmark Analysis. The three phases of SimBA are: stalk, where we conduct dataset model comparisons, prowl, where we discover a representative subset, and pounce, where we use the representative subset to predict performance on a held-out set of models. Applying SimBA to three popular LM benchmarks: HELM, MMLU, and BigBenchLite reveals that across all three benchmarks, datasets and models relate strongly to one another (stalk). We develop an representative set discovery algorithm which covers a benchmark using raw evaluation scores alone. Using our algorithm, we find that with 6.25% (1/16), 1.7% (1/58), and 28.4% (21/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95% (prowl). Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error (pounce). Taken together, SimBA can help model developers improve efficiency during model training and dataset creators validate whether their newly created dataset differs from existing datasets in a benchmark. Our code is open source, available at this https URL.
70. 【2510.17947】PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
链接:https://arxiv.org/abs/2510.17947
作者:Neeladri Bhuiya,Madhav Aggarwal,Diptanshu Purwar
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Large Language Models, Large Language, Language Models, Large, multi-turn
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
71. 【2510.17941】Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
链接:https://arxiv.org/abs/2510.17941
作者:Stewart Slocum,Julian Minder,Clément Dumas,Henry Sleight,Ryan Greenblatt,Samuel Marks,Rowan Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, editing techniques promise, Knowledge, Knowledge editing techniques, large language
备注:
点击查看摘要
Abstract:Knowledge editing techniques promise to implant new factual knowledge into large language models (LLMs). But do LLMs really believe these facts? We develop a framework to measure belief depth and use it to evaluate the success of knowledge editing techniques. We operationalize belief depth as the extent to which implanted knowledge 1) generalizes to related contexts (e.g. Fermi estimates several logical steps removed), 2) is robust to self-scrutiny and direct challenge, and 3) is represented similarly to genuine knowledge (as measured by linear probes). Our evaluations show that simple prompting and mechanistic editing techniques fail to implant knowledge deeply. In contrast, Synthetic Document Finetuning (SDF) - where models are trained on LLM-generated documents consistent with a fact - often succeeds at implanting beliefs that behave similarly to genuine knowledge. However, SDF's success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge. Overall, our work introduces measurable criteria for belief depth and enables the rigorous evaluation necessary for deploying knowledge editing in real-world applications.
72. 【2510.17934】AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM
链接:https://arxiv.org/abs/2510.17934
作者:Haoyu Huang,Hong Ting Tsang,Jiaxin Bai,Xi Peng,Gong Zhang,Yangqiu Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Retrieval-augmented generation, large language models, augmenting large language, language models, RAG methods heavily
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called \textbf{AtlasKV}, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs' inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.
73. 【2510.17930】Diagnosing Representation Dynamics in NER Model Extension
链接:https://arxiv.org/abs/2510.17930
作者:Xirui Zhang,Philippe de La Chevasnerie,Benoit Fabre(papernest)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Extending Named Entity, Named Entity Recognition, Extending Named, noisy spoken-language data, Entity Recognition
备注:
点击查看摘要
Abstract:Extending Named Entity Recognition (NER) models to new PII entities in noisy spoken-language data is a common need. We find that jointly fine-tuning a BERT model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII (EMAIL, PHONE) results in minimal degradation for original classes. We investigate this "peaceful coexistence," hypothesizing that the model uses independent semantic vs. morphological feature mechanisms. Using an incremental learning setup as a diagnostic tool, we measure semantic drift and find two key insights. First, the LOC (location) entity is uniquely vulnerable due to a representation overlap with new PII, as it shares pattern-like features (e.g., postal codes). Second, we identify a "reverse O-tag representation drift." The model, initially trained to map PII patterns to 'O', blocks new learning. This is resolved only by unfreezing the 'O' tag's classifier, allowing the background class to adapt and "release" these patterns. This work provides a mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and 'O' tag plasticity.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.17930 [cs.CL]
(or
arXiv:2510.17930v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.17930
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
74. 【2510.17924】Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs
链接:https://arxiv.org/abs/2510.17924
作者:Yehor Tereshchenko,Mika Hämäläinen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, comprehensive comparative analysis, analysis of Natural, Natural Language, paper presents
备注: Published in the Journal of Data Mining Digital Humanities (JDMDH), special issue NLP4DH
点击查看摘要
Abstract:This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.
75. 【2510.17922】Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models
链接:https://arxiv.org/abs/2510.17922
作者:Shuodi Liu,Yingzhuo Liu,Zi Wang,Yusheng Wang,Huijia Wu,Liuyu Xiang,Zhaofeng He
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, driving extensive research, demonstrated remarkable reasoning, Large language, task decomposition
备注: Accepted to the Main Conference of EMNLP 2025 (Oral)
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at this https URL.
76. 【2510.17921】CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections
链接:https://arxiv.org/abs/2510.17921
作者:Keuntae Kim,Eunhye Jeong,Sehyeon Lee,Seohee Yoon,Yong Suk Choi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent advances, large language models, remarkably successful, advances in enhancing, ability of large
备注: NeurIPS 2025
点击查看摘要
Abstract:Recent advances in enhancing the reasoning ability of large language models (LLMs) have been remarkably successful. LLMs trained with reinforcement learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a method that defines and classifies mathematical solutions into typical, creative, and hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score) on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4545 math problems collected from 181 math contests (AJHSME, AMC, AIME).
77. 【2510.17918】JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs
链接:https://arxiv.org/abs/2510.17918
作者:Junlan Feng,Fanyu Meng,Chong Long,Pengyu Cong,Duqing Wang,Yan Zheng,Yuyao Zhang,Xuanchang Gao,Ye Yuan,Yunfei Ma,Zhijie Ren,Fan Yang,Na Wu,Di Jin,Chao Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:data, pre-training data, large language models, collectively addressing, credibility concerns
备注:
点击查看摘要
Abstract:The hallucination and credibility concerns of large language models (LLMs) are global challenges that the industry is collectively addressing. Recently, a significant amount of advances have been made on post-training and inference techniques to mitigate these challenges. However, it is widely agreed that unsafe and hallucinations of LLMs intrinsically originate from pre-training, involving pre-training data and the next-token prediction learning mechanism. In this paper, we focus on enhancing pre-training data to improve the trustworthiness and safety of LLMs. Since the data is vast, it's almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases. Moreover, the pre-training data lack grounding in real-world knowledge. Each piece of data is treated as a sequence of tokens rather than as a representation of a part of the world. To overcome these issues, we propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios. We argue that most source data are created by the authors for specific purposes in a certain spatial-temporal context. They have played a role in the real world. By incorporating related world context information, we aim to better anchor pre-training data within real-world scenarios, thereby reducing uncertainty in model training and enhancing the model's safety and trustworthiness. We refer to our Data with World Context as DWC. We continue pre-training an earlier checkpoint of JT-35B-Base with 1.5 trillion of DWC tokens. We introduce our post-training procedures to activate the potentials of DWC. Compared with the Qwen model of a similar scale, JT-Safe-35B achieves an average performance improvement of 1.79% on the Safety and Trustworthy evaluation benchmarks, while being pretrained with only 6.2 trillion tokens.
78. 【2510.17910】Interpretability Framework for LLMs in Undergraduate Calculus
链接:https://arxiv.org/abs/2510.17910
作者:Sagnik Dakshit,Sushmita Sinha Roy
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, capture the quality, multistep logic
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly being used in education, yet their correctness alone does not capture the quality, reliability, or pedagogical validity of their problem-solving behavior, especially in mathematics, where multistep logic, symbolic reasoning, and conceptual clarity are critical. Conventional evaluation methods largely focus on final answer accuracy and overlook the reasoning process. To address this gap, we introduce a novel interpretability framework for analyzing LLM-generated solutions using undergraduate calculus problems as a representative domain. Our approach combines reasoning flow extraction and decomposing solutions into semantically labeled operations and concepts with prompt ablation analysis to assess input salience and output stability. Using structured metrics such as reasoning complexity, phrase sensitivity, and robustness, we evaluated the model behavior on real Calculus I to III university exams. Our findings revealed that LLMs often produce syntactically fluent yet conceptually flawed solutions, with reasoning patterns sensitive to prompt phrasing and input variation. This framework enables fine-grained diagnosis of reasoning failures, supports curriculum alignment, and informs the design of interpretable AI-assisted feedback tools. This is the first study to offer a structured, quantitative, and pedagogically grounded framework for interpreting LLM reasoning in mathematics education, laying the foundation for the transparent and responsible deployment of AI in STEM learning environments.
79. 【2510.17909】Atomic Literary Styling: Mechanistic Manipulation of Prose Generation in Neural Language Models
链接:https://arxiv.org/abs/2510.17909
作者:Tsogt-Ochir Enkhbayar
类目:Computation and Language (cs.CL)
关键词:Herman Melville Bartleby, identifying individual neurons, rigid AI-generated text, identifying individual, discriminate between exemplary
备注: 12 pages, 3 figures, 4 tables
点击查看摘要
Abstract:We present a mechanistic analysis of literary style in GPT-2, identifying individual neurons that discriminate between exemplary prose and rigid AI-generated text. Using Herman Melville's Bartleby, the Scrivener as a corpus, we extract activation patterns from 355 million parameters across 32,768 neurons in late layers. We find 27,122 statistically significant discriminative neurons ($p 0.05$), with effect sizes up to $|d| = 1.4$. Through systematic ablation studies, we discover a paradoxical result: while these neurons correlate with literary text during analysis, removing them often improves rather than degrades generated prose quality. Specifically, ablating 50 high-discriminating neurons yields a 25.7% improvement in literary style metrics. This demonstrates a critical gap between observational correlation and causal necessity in neural networks. Our findings challenge the assumption that neurons which activate on desirable inputs will produce those outputs during generation, with implications for mechanistic interpretability research and AI alignment.
80. 【2510.17904】BreakFun: Jailbreaking LLMs via Schema Exploitation
链接:https://arxiv.org/abs/2510.17904
作者:Amirkia Rafiei Oskooei,Mehmet S. Aktas
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, proficiency of Large, processing structured data, Language Models
备注:
点击查看摘要
Abstract:The proficiency of Large Language Models (LLMs) in processing structured data and adhering to syntactic rules is a capability that drives their widespread adoption but also makes them paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM's adherence to structured schemas. BreakFun employs a three-part prompt that combines an innocent framing and a Chain-of-Thought distraction with a core "Trojan Schema"--a carefully crafted data structure that compels the model to generate harmful content, exploiting the LLM's strong tendency to follow structures and schemas. We demonstrate this vulnerability is highly transferable, achieving an average success rate of 89% across 13 foundational and proprietary models on JailbreakBench, and reaching a 100% Attack Success Rate (ASR) on several prominent models. A rigorous ablation study confirms this Trojan Schema is the attack's primary causal factor. To counter this, we introduce the Adversarial Prompt Deconstruction guardrail, a defense that utilizes a secondary LLM to perform a "Literal Transcription"--extracting all human-readable text to isolate and reveal the user's true harmful intent. Our proof-of-concept guardrail demonstrates high efficacy against the attack, validating that targeting the deceptive schema is a viable mitigation strategy. Our work provides a look into how an LLM's core strengths can be turned into critical weaknesses, offering a fresh perspective for building more robustly aligned models.
81. 【2510.17900】Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning
链接:https://arxiv.org/abs/2510.17900
作者:Kush Juvekar,Arghya Bhattacharya,Sai Khadloya,Utkarsh Saxena
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, entering legal workflows, Large language, India public legal, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) are entering legal workflows, yet we lack a jurisdiction-specific framework to assess their baseline competence therein. We use India's public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real-world exam conditions. To probe beyond multiple-choice questions, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court's Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes: procedural or format compliance, authority or citation discipline, and forum-appropriate voice and structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.
82. 【2510.17895】Hierarchical Federated Unlearning for Large Language Models
链接:https://arxiv.org/abs/2510.17895
作者:Yisheng Zhong,Zhengbang Yang,Zhuangdi Zhu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, remove undesirable knowledge, Large Language, Language Models, real-world applications
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly integrated into real-world applications, raising concerns about privacy, security and the need to remove undesirable knowledge. Machine Unlearning has emerged as a promising solution, yet faces two key challenges: (1) practical unlearning needs are often continuous and heterogeneous, and (2) they involve decentralized, sensitive data with asymmetric access. These factors result in inter-domain and intra-domain interference, which further amplifies the dilemma of unbalanced forgetting and retaining performance. In response, we propose a federated unlearning approach for LLMs that is scalable and privacy preserving. Our method decouples unlearning and retention via task-specific adapter learning and employs a hierarchical merging strategy to mitigate conflicting objectives and enables robust, adaptable unlearning updates. Comprehensive experiments on benchmarks of WMDP, MUSE, and TOFU showed that our approach effectively handles heterogeneous unlearning requests while maintaining strong LLM utility compared with baseline methods.
83. 【2510.17892】Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review
链接:https://arxiv.org/abs/2510.17892
作者:Zhyar Rzgar K. Rostam,Gábor Kertész
类目:Computation and Language (cs.CL)
关键词:online information necessitates, information necessitates efficient, necessitates efficient methods, domain-specific text classification, text classification
备注: 41 pages, 10 figures, 13 tables
点击查看摘要
Abstract:The exponential increase in scientific literature and online information necessitates efficient methods for extracting knowledge from textual data. Natural language processing (NLP) plays a crucial role in addressing this challenge, particularly in text classification tasks. While large language models (LLMs) have achieved remarkable success in NLP, their accuracy can suffer in domain-specific contexts due to specialized vocabulary, unique grammatical structures, and imbalanced data distributions. In this systematic literature review (SLR), we investigate the utilization of pre-trained language models (PLMs) for domain-specific text classification. We systematically review 41 articles published between 2018 and January 2024, adhering to the PRISMA statement (preferred reporting items for systematic reviews and meta-analyses). This review methodology involved rigorous inclusion criteria and a multi-step selection process employing AI-powered tools. We delve into the evolution of text classification techniques and differentiate between traditional and modern approaches. We emphasize transformer-based models and explore the challenges and considerations associated with using LLMs for domain-specific text classification. Furthermore, we categorize existing research based on various PLMs and propose a taxonomy of techniques used in the field. To validate our findings, we conducted a comparative experiment involving BERT, SciBERT, and BioBERT in biomedical sentence classification. Finally, we present a comparative study on the performance of LLMs in text classification tasks across different domains. In addition, we examine recent advancements in PLMs for domain-specific text classification and offer insights into future directions and limitations in this rapidly evolving domain.
84. 【2510.17885】Metrics and evaluations for computational and sustainable AI efficiency
链接:https://arxiv.org/abs/2510.17885
作者:Hongyuan Liu,Xinyang Liu,Guosheng Hu
类目:Performance (cs.PF); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:created unprecedented demands, Artificial Intelligence, models remain fragmented, deployed models remain, advancement of Artificial
备注: 11 pages, 2 tables
点击查看摘要
Abstract:The rapid advancement of Artificial Intelligence (AI) has created unprecedented demands for computational power, yet methods for evaluating the performance, efficiency, and environmental impact of deployed models remain fragmented. Current approaches often fail to provide a holistic view, making it difficult to compare and optimise systems across heterogeneous hardware, software stacks, and numeric precisions. To address this gap, we propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics under realistic serving conditions. Our framework provides a pragmatic, carbon-aware evaluation by systematically measuring latency and throughput distributions, energy consumption, and location-adjusted carbon emissions, all while maintaining matched accuracy constraints for valid comparisons. We apply this methodology to multi-precision models across diverse hardware platforms, from data-centre accelerators like the GH200 to consumer-level GPUs such as the RTX 4090, running on mainstream software stacks including PyTorch, TensorRT, and ONNX Runtime. By systematically categorising these factors, our work establishes a rigorous benchmarking framework that produces decision-ready Pareto frontiers, clarifying the trade-offs between accuracy, latency, energy, and carbon. The accompanying open-source code enables independent verification and facilitates adoption, empowering researchers and practitioners to make evidence-based decisions for sustainable AI deployment.
85. 【2510.17882】Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints
链接:https://arxiv.org/abs/2510.17882
作者:Minfeng Qi,Zhongmin Cao,Qin Wang,Ningran Li,Tianqing Zhu
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:scholarly communication, central infrastructures, infrastructures for scholarly, LLMs, large language models
备注:
点击查看摘要
Abstract:Preprint repositories become central infrastructures for scholarly communication. Their expansion transforms how research is circulated and evaluated before journal publication. Generative large language models (LLMs) introduce a further potential disruption by altering how manuscripts are written. While speculation abounds, systematic evidence of whether and how LLMs reshape scientific publishing remains limited. This paper addresses the gap through a large-scale analysis of more than 2.1 million preprints spanning 2016--2025 (115 months) across four major repositories (i.e., arXiv, bioRxiv, medRxiv, SocArXiv). We introduce a multi-level analytical framework that integrates interrupted time-series models, collaboration and productivity metrics, linguistic profiling, and topic modeling to assess changes in volume, authorship, style, and disciplinary orientation. Our findings reveal that LLMs have accelerated submission and revision cycles, modestly increased linguistic complexity, and disproportionately expanded AI-related topics, while computationally intensive fields benefit more than others. These results show that LLMs act less as universal disruptors than as selective catalysts, amplifying existing strengths and widening disciplinary divides. By documenting these dynamics, the paper provides the first empirical foundation for evaluating the influence of generative AI on academic publishing and highlights the need for governance frameworks that preserve trust, fairness, and accountability in an AI-enabled research ecosystem.
Subjects:
Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
Cite as:
arXiv:2510.17882 [cs.CY]
(or
arXiv:2510.17882v1 [cs.CY] for this version)
https://doi.org/10.48550/arXiv.2510.17882
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
86. 【2510.17881】POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
链接:https://arxiv.org/abs/2510.17881
作者:Yizhuo Chen,Xin Liu,Ruijie Wang,Zheng Li,Pei Chen,Changlong Yu,Priyanka Nigam,Meng Jiang,Bing Yin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:experiences remain inconsistent, remain inconsistent due, Direct Preference Optimization, user experiences remain, achieve strong benchmark
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve strong benchmark performance, yet user experiences remain inconsistent due to diverse preferences in style, tone, and reasoning mode. Nevertheless, existing alignment techniques such as reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO) largely optimize toward population-level averages and overlook individual variation. Naive personalization strategies like per-user fine-tuning are computationally prohibitive, and in-context approaches that prepend raw user signals often suffer from inefficiency and noise. To address these challenges, we propose POPI, a general framework that introduces a preference inference model to distill heterogeneous user signals into concise natural language summaries. These summaries act as transparent, compact, and transferable personalization representations that condition a shared generation model to produce personalized responses. POPI jointly optimizes both preference inference and personalized generation under a unified objective using reinforcement learning, ensuring summaries maximally encode useful preference information. Extensive experiments across four personalization benchmarks demonstrate that POPI consistently improves personalization accuracy while reducing context overhead by a large margin. Moreover, optimized summaries seamlessly transfer to frozen off-the-shelf LLMs, enabling plug-and-play personalization without weight updates.
87. 【2510.17880】Outraged AI: Large language models prioritise emotion over cost in fairness enforcement
链接:https://arxiv.org/abs/2510.17880
作者:Hao Liu,Yiqing Dai,Haotian Tan,Yu Lei,Yujia Zhou,Zhen Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:similarly remains unknown, emotion similarly remains, large language models, remains unknown, large language
备注:
点击查看摘要
Abstract:Emotions guide human decisions, but whether large language models (LLMs) use emotion similarly remains unknown. We tested this using altruistic third-party punishment, where an observer incurs a personal cost to enforce fairness, a hallmark of human morality and often driven by negative emotion. In a large-scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions, LLMs used emotion to guide punishment, sometimes even more strongly than humans did: Unfairness elicited stronger negative emotion that led to more punishment; punishing unfairness produced more positive emotion than accepting; and critically, prompting self-reports of emotion causally increased punishment. However, mechanisms diverged: LLMs prioritized emotion over cost, enforcing norms in an almost all-or-none manner with reduced cost sensitivity, whereas humans balanced fairness and cost. Notably, reasoning models (o3-mini, DeepSeek-R1) were more cost-sensitive and closer to human behavior than foundation models (GPT-3.5, DeepSeek-V3), yet remained heavily emotion-driven. These findings provide the first causal evidence of emotion-guided moral decisions in LLMs and reveal deficits in cost calibration and nuanced fairness judgements, reminiscent of early-stage human responses. We propose that LLMs progress along a trajectory paralleling human development; future models should integrate emotion with context-sensitive reasoning to achieve human-like emotional intelligence.
88. 【2510.17844】Modeling Layered Consciousness with Multi-Agent Large Language Models
链接:https://arxiv.org/abs/2510.17844
作者:Sang Hun Kim,Jongmin Lee,Dongkyu Park,So Young Lee,Yosep Chong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:modeling artificial consciousness, large language models, grounded in psychoanalytic, psychoanalytic theory, Personalization Module combining
备注: 20 pages, 4 figures, accepted for presentation at EMNLP 2025 Workshop on Active and Passive LLM Personalization (PALS) OpenReview: [this https URL](https://openreview.net/forum?id=rUtNkYvGJI)
点击查看摘要
Abstract:We propose a multi-agent framework for modeling artificial consciousness in large language models (LLMs), grounded in psychoanalytic theory. Our \textbf{Psychodynamic Model} simulates self-awareness, preconsciousness, and unconsciousness through agent interaction, guided by a Personalization Module combining fixed traits and dynamic needs. Using parameter-efficient fine-tuning on emotionally rich dialogues, the system was evaluated across eight personalized conditions. An LLM as a judge approach showed a 71.2\% preference for the fine-tuned model, with improved emotional depth and reduced output variance, demonstrating its potential for adaptive, personalized cognition.
信息检索
1. 【2510.18527】LLMs as Sparse Retrievers:A Framework for First-Stage Product Search
链接:https://arxiv.org/abs/2510.18527
作者:Hongru Song,Yu-an Liu,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Sen Li,Wenjun Peng,Fuyu Lv,Xueqi Cheng
类目:Information Retrieval (cs.IR)
关键词:modern e-commerce platforms, Product search, e-commerce platforms, crucial component, component of modern
备注: 16 pages
点击查看摘要
Abstract:Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day. In product search systems, first-stage retrieval should achieve high recall while ensuring efficient online deployment. Sparse retrieval is particularly attractive in this context due to its interpretability and storage efficiency. However, sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios. With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues and thereby improving retrieval quality. Directly applying LLMs to sparse retrieval in product search exposes two key challenges:(1)Queries and product titles are typically short and highly susceptible to LLM-induced hallucinations, such as generating irrelevant expansion terms or underweighting critical literal terms like brand names and model numbers;(2)The large vocabulary space of LLMs leads to difficulty in initializing training effectively, making it challenging to learn meaningful sparse representations in such ultra-high-dimensional this http URL address these challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that alleviates hallucination in lexical expansion by reinforcing underweighted literal terms through a residual compensation mechanism; and (2)A lexical focusing window that facilitates effective training initialization via a coarse-to-fine sparsification this http URL offline and online experiments show that PROSPER significantly outperforms sparse baselines and achieves recall performance comparable to advanced dense retrievers, while also achieving revenue increments online.
2. 【2510.18433】ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
链接:https://arxiv.org/abs/2510.18433
作者:Yuanhe Guo,Linxi Xie,Zhuoran Chen,Kangrui Yu,Ryan Po,Guandao Yang,Gordon Wetztein,Hongyi Wen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:understand fine-grained individual, user preference annotations, user preference, studying generative models, fine-grained user preference
备注:
点击查看摘要
Abstract:We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
3. 【2510.18394】Censorship Chokepoints: New Battlegrounds for Regional Surveillance, Censorship and Influence on the Internet
链接:https://arxiv.org/abs/2510.18394
作者:Yong Zhang,Nishanth Sastry
类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI)
关键词:general public, important conduits, Undoubtedly, Internet, conduits to information
备注: 15 pages, 2 figures
点击查看摘要
Abstract:Undoubtedly, the Internet has become one of the most important conduits to information for the general public. Nonetheless, Internet access can be and has been limited systematically or blocked completely during political events in numerous countries and regions by various censorship mechanisms. Depending on where the core filtering component is situated, censorship techniques have been classified as client-based, server-based, or network-based. However, as the Internet evolves rapidly, new and sophisticated censorship techniques have emerged, which involve techniques that cut across locations and involve new forms of hurdles to information access. We argue that modern censorship can be better understood through a new lens that we term chokepoints, which identifies bottlenecks in the content production or delivery cycle where efficient new forms of large-scale client-side surveillance and filtering mechanisms have emerged.
4. 【2510.18364】Evaluating LLM-Based Mobile App Recommendations: An Empirical Study
链接:https://arxiv.org/abs/2510.18364
作者:Quim Motger,Xavier Franch,Vincenzo Gervasi,Jordi Marco
类目:Information Retrieval (cs.IR); Software Engineering (cs.SE)
关键词:Large Language Models, natural language prompts, Large Language, Language Models, App Store Optimization
备注: Under review
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to recommend mobile applications through natural language prompts, offering a flexible alternative to keyword-based app store search. Yet, the reasoning behind these recommendations remains opaque, raising questions about their consistency, explainability, and alignment with traditional App Store Optimization (ASO) metrics. In this paper, we present an empirical analysis of how widely-used general purpose LLMs generate, justify, and rank mobile app recommendations. Our contributions are: (i) a taxonomy of 16 generalizable ranking criteria elicited from LLM outputs; (ii) a systematic evaluation framework to analyse recommendation consistency and responsiveness to explicit ranking instructions; and (iii) a replication package to support reproducibility and future research on AI-based recommendation systems. Our findings reveal that LLMs rely on a broad yet fragmented set of ranking criteria, only partially aligned with standard ASO metrics. While top-ranked apps tend to be consistent across runs, variability increases with ranking depth and search specificity. LLMs exhibit varying sensitivity to explicit ranking instructions - ranging from substantial adaptations to near-identical outputs - highlighting their complex reasoning dynamics in conversational app discovery. Our results aim to support end-users, app developers, and recommender-systems researchers in navigating the emerging landscape of conversational app discovery.
5. 【2510.18355】KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
链接:https://arxiv.org/abs/2510.18355
作者:Mohd Ruhul Ameen,Akif Islam,Farjana Aktar,M. Saifuzzaman Rafat
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:Optical Character Recognition, accessing timely, continue to face, face challenges, challenges in accessing
备注: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE 2025)
点击查看摘要
Abstract:In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
6. 【2510.18277】Enhancing Hotel Recommendations with AI: LLM-Based Review Summarization and Query-Driven Insights
链接:https://arxiv.org/abs/2510.18277
作者:Nikolaos Belibasakis,Anastasios Giannaros,Ioanna Giannoukou,Spyros Sioutas
类目:Information Retrieval (cs.IR)
关键词:AirBnB offers make, http URL platform, booking platform providers, booking platform, text-based user reviews
备注:
点击查看摘要
Abstract:The increasing number of data a booking platform such as this http URL and AirBnB offers make it challenging for interested parties to browse through the available accommodations and analyze reviews in an efficient way. Efforts have been made from the booking platform providers to utilize recommender systems in an effort to enable the user to filter the results by factors such as stars, amenities, cost but most valuable insights can be provided by the unstructured text-based reviews. Going through these reviews one-by-one requires a substantial amount of time to be devoted while a respectable percentage of the reviews won't provide to the user what they are actually looking for. This research publication explores how Large Language Models (LLMs) can enhance short rental apartments recommendations by summarizing and mining key insights from user reviews. The web application presented in this paper, named "instaGuide", automates the procedure of isolating the text-based user reviews from a property on the this http URL platform, synthesizing the summary of the reviews, and enabling the user to query specific aspects of the property in an effort to gain feedback on their personal questions/criteria. During the development of the instaGuide tool, numerous LLM models were evaluated based on accuracy, cost, and response quality. The results suggest that the LLM-powered summarization reduces significantly the amount of time the users need to devote on their search for the right short rental apartment, improving the overall decision-making procedure.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2510.18277 [cs.IR]
(or
arXiv:2510.18277v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2510.18277
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2510.18239】LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling
链接:https://arxiv.org/abs/2510.18239
作者:Yunjiang Jiang,Ayush Agarwal,Yang Liu,Bi Xue
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:increasing model capacity, longer user histories, processing longer user, Scaling large recommendation, systems requires advancing
备注: 16 pages
点击查看摘要
Abstract:Scaling large recommendation systems requires advancing three major frontiers: processing longer user histories, expanding candidate sets, and increasing model capacity. While promising, transformers' computational cost scales quadratically with the user sequence length and linearly with the number of candidates. This trade-off makes it prohibitively expensive to expand candidate sets or increase sequence length at inference, despite the significant performance improvements. We introduce \textbf{LIME}, a novel architecture that resolves this trade-off. Through two key innovations, LIME fundamentally reduces computational complexity. First, low-rank ``link embeddings" enable pre-computation of attention weights by decoupling user and candidate interactions, making the inference cost nearly independent of candidate set size. Second, a linear attention mechanism, \textbf{LIME-XOR}, reduces the complexity with respect to user sequence length from quadratic ($O(N^2)$) to linear ($O(N)$). Experiments on public and industrial datasets show LIME achieves near-parity with state-of-the-art transformers but with a 10$\times$ inference speedup on large candidate sets or long sequence lengths. When tested on a major recommendation platform, LIME improved user engagement while maintaining minimal inference costs with respect to candidate set size and user history length, establishing a new paradigm for efficient and expressive recommendation systems.
Comments:
16 pages
Subjects:
Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2510.18239 [cs.IR]
(or
arXiv:2510.18239v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2510.18239
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2510.18104】From AutoRecSys to AutoRecLab: A Call to Build, Evaluate, and Govern Autonomous Recommender-Systems Research Labs
链接:https://arxiv.org/abs/2510.18104
作者:Joeran Beel,Bela Gipp,Tobias Vente,Moritz Baumgart,Philipp Meister
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:largely neglects automating, Recommender-Systems Research Lab, Autonomous Recommender-Systems Research, Recommender-systems research, evaluation advances
备注:
点击查看摘要
Abstract:Recommender-systems research has accelerated model and evaluation advances, yet largely neglects automating the research process itself. We argue for a shift from narrow AutoRecSys tools -- focused on algorithm selection and hyper-parameter tuning -- to an Autonomous Recommender-Systems Research Lab (AutoRecLab) that integrates end-to-end automation: problem ideation, literature analysis, experimental design and execution, result interpretation, manuscript drafting, and provenance logging. Drawing on recent progress in automated science (e.g., multi-agent AI Scientist and AI Co-Scientist systems), we outline an agenda for the RecSys community: (1) build open AutoRecLab prototypes that combine LLM-driven ideation and reporting with automated experimentation; (2) establish benchmarks and competitions that evaluate agents on producing reproducible RecSys findings with minimal human input; (3) create review venues for transparently AI-generated submissions; (4) define standards for attribution and reproducibility via detailed research logs and metadata; and (5) foster interdisciplinary dialogue on ethics, governance, privacy, and fairness in autonomous research. Advancing this agenda can increase research throughput, surface non-obvious insights, and position RecSys to contribute to emerging Artificial Research Intelligence. We conclude with a call to organise a community retreat to coordinate next steps and co-author guidance for the responsible integration of automated research systems.
计算机视觉
1. 【2510.18876】Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
链接:https://arxiv.org/abs/2510.18876
作者:Haochen Wang,Yuhao Wang,Tao Zhang,Yikang Zhou,Yanwei Li,Jiacong Wang,Ye Tian,Jiahao Meng,Zilong Huang,Guangcan Mai,Anran Wang,Yunhai Tong,Zhuochen Wang,Xiangtai Li,Zhaoxiang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
2. 【2510.18873】DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
链接:https://arxiv.org/abs/2510.18873
作者:Ziang Zhang,Zehan Wang,Guanghao Zhang,Weilong Dai,Yan Xia,Ziang Yan,Minjie Hong,Zhou Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dynamic Spatial Intelligence, dynamic spatial, move simultaneously, dynamic spatial relationships, introduce Dynamic Spatial
备注:
点击查看摘要
Abstract:Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.
3. 【2510.18866】LightMem: Lightweight and Efficient Memory-Augmented Generation
链接:https://arxiv.org/abs/2510.18866
作者:Jizhan Fang,Xinle Deng,Haoming Xu,Ziyan Jiang,Yuqi Tang,Ziwen Xu,Shumin Deng,Yunzhi Yao,Mengru Wang,Shuofei Qiao,Huajun Chen,Ningyu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Large Language Models, Large Language, effectively leverage historical, leverage historical interaction, Language Models
备注: Work in progress
点击查看摘要
Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at this https URL.
4. 【2510.18851】DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution
链接:https://arxiv.org/abs/2510.18851
作者:Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Shihao Wang,Tianhe Wu,Qiaosi Yi,Shuai Li,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Benefiting from pre-trained, methods can synthesize, realistic details, synthesize rich, rich and realistic
备注: Accept by NeurIPS 2025
点击查看摘要
Abstract:Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.
5. 【2510.18840】See the Text: From Tokenization to Visual Reading
链接:https://arxiv.org/abs/2510.18840
作者:Ling Xing,Alex Jinpeng Wang,Rui Yan,Hongyu Qu,Zechao Li,Jinhui Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:People, People see text, language models, Abstract, large language models
备注:
点击查看摘要
Abstract:People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
6. 【2510.18837】FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
链接:https://arxiv.org/abs/2510.18837
作者:Yubin Zheng,Pak-Hei Yeung,Jing Xia,Tianjie Ju,Peng Tang,Weidong Qiu,Jagath C. Rajapakse
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:train machine learning, collaboratively train machine, machine learning models, exposing local data, enables multiple clients
备注: Accepted at MM 2025
点击查看摘要
Abstract:Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
7. 【2510.18825】Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework
链接:https://arxiv.org/abs/2510.18825
作者:Yujie Xing,Xiao Wang,Bin Wu,Hai Huang,Chuan Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:representation learning due, graph representation learning, representation learning, learning due, Dual Attention Computation
备注: Accepted by NeurIPS 2025 (Poster)
点击查看摘要
Abstract:Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.
8. 【2510.18822】SAM 2++: Tracking Anything at Any Granularity
链接:https://arxiv.org/abs/2510.18822
作者:Jiaming Zhang,Cheng Liang,Yichun Yang,Chenkai Zeng,Yutao Cui,Xinwen Zhang,Xin Zhou,Kai Ma,Gangshan Wu,Limin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video tracking aims, aims at finding, finding the specific, subsequent frames, tracking
备注: 8 pages, and 10 pages in Supplementary Material
点击查看摘要
Abstract:Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
9. 【2510.18819】An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection
链接:https://arxiv.org/abs/2510.18819
作者:Neel Patel,Alexander Wong,Ashkan Ebadi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:global health issue, critical global health, health issue, remote areas, remains a critical
备注: 16 pages, 3 figures
点击查看摘要
Abstract:Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher--student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.
10. 【2510.18813】A Geometric Approach to Steerable Convolutions
链接:https://arxiv.org/abs/2510.18813
作者:Soumyabrata Kundu,Risi Kondor
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:group theoretical approach, theoretical approach adopted, convolutional neural networks, steerable convolutional neural, group theoretical
备注:
点击查看摘要
Abstract:In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in $d$ dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch--Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.
11. 【2510.18795】ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
链接:https://arxiv.org/abs/2510.18795
作者:Xiaoxing Hu,Kaicheng Yang,Ziyong Feng,Qi Ming,Zonghao Guo,Xiang An,Ziyong Feng,Junchi Yan,Xue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CLIP text encoder, CLIP image encoder, CLIP text, original CLIP text, CLIP
备注: 17 pages, 5 fiugres
点击查看摘要
Abstract:The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at this https URL.
12. 【2510.18781】Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
链接:https://arxiv.org/abs/2510.18781
作者:Wenping Jin,Yuyang Tang,Li Zhu,Fei Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable efficiency, universally deployed, parameter tuning, efficiency and robustness, recent class
备注:
点击查看摘要
Abstract:A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at this https URL.
13. 【2510.18775】UltraGen: High-Resolution Video Generation with Hierarchical Attention
链接:https://arxiv.org/abs/2510.18775
作者:Teng Hu,Jiangning Zhang,Zihan Su,Ran Yi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produce visually compelling, Recent advances, visually compelling videos, video generation, virtual reality
备注:
点击查看摘要
Abstract:Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
14. 【2510.18773】Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction
链接:https://arxiv.org/abs/2510.18773
作者:Jannis Fleckenstein,David Kreismann,Tamara Rosemary Govindasamy,Thomas Brunschwiler,Etienne Vos,Mattia Rigotti
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:climate change progress, change progress, frequent and severe, urban heat, urban heat island
备注: 10 pages, 9 figures. Accepted at the NeurIPS 2025 Workshop on Tackling Climate Change with Machine Learning
点击查看摘要
Abstract:As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model's accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.
15. 【2510.18751】Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
链接:https://arxiv.org/abs/2510.18751
作者:Patterson Hsieh,Jerry Yeh,Mao-Chi He,Wen-Han Hsieh,Elvis Hsieh
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:threaten aquatic ecosystems, harmful algal bloom, Climate change, toxin release, oxygen depletion
备注:
点击查看摘要
Abstract:Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
16. 【2510.18740】SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery
链接:https://arxiv.org/abs/2510.18740
作者:Zhenqi He,Yuanpei Liu,Kai Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generalized Category Discovery, Category Discovery, Generalized Category, problem of Generalized, paper investigates
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: this https URL
17. 【2510.18739】Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting
链接:https://arxiv.org/abs/2510.18739
作者:Hao Wang,Ying Zhou,Haoyu Zhao,Rui Wang,Qiang Hu,Xing Zhang,Qiang Li,Zhiwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling critical applications, Gaussian Splatting, enabling critical, lesion tracking, pivotal technique
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time view synthesis in colonoscopy, enabling critical applications such as virtual colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static illumination and that observed appearance depends solely on viewing angle, which causes incompatibility with the photometric variations in colonoscopic scenes induced by dynamic light source/camera. This mismatch forces most 3DGS methods to introduce structure-violating vaporous Gaussian blobs between the camera and tissues to compensate for illumination attenuation, ultimately degrading the quality of 3D reconstructions. Previous works only consider the illumination attenuation caused by light distance, ignoring the physical characters of light source and camera. In this paper, we propose ColIAGS, an improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance under varying illumination, we introduce an Improved Appearance Modeling with two types of illumination attenuation factors, which enables Gaussians to adapt to photometric variations while preserving geometry accuracy. To ensure the geometry approximation condition of appearance modeling, we propose an Improved Geometry Modeling using high-dimensional view embedding to enhance Gaussian geometry attribute prediction. Furthermore, another cosine embedding input is leveraged to generate illumination attenuation solutions in an implicit manner. Comprehensive experimental results on standard benchmarks demonstrate that our proposed ColIAGS achieves the dual capabilities of novel view synthesis and accurate geometric reconstruction. It notably outperforms other state-of-the-art methods by achieving superior rendering fidelity while significantly reducing Depth MSE. Code will be available.
18. 【2510.18726】IF-VidCap: Can Video Caption Models Follow Instructions?
链接:https://arxiv.org/abs/2510.18726
作者:Shihao Li,Yuanxing Zhang,Jiangtao Wu,Zhide Lei,Yiwen He,Runzhe Wen,Chenxi Liao,Chengkang Jiang,An Ping,Shuo Gao,Suhan Wang,Zhaozhou Bian,Zijun Zhou,Jingyi Xie,Jiayi Zhou,Jing Wang,Yifan Yao,Weihao Xie,Yingshui Tan,Yanghai Wang,Qianqian Xie,Zhaoxiang Zhang,Jiaheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, practical applications require
备注: [this https URL](https://github.com/NJU-LINK/IF-VidCap)
点击查看摘要
Abstract:Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
19. 【2510.18716】SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
链接:https://arxiv.org/abs/2510.18716
作者:Siyong Jian,Huan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Janus-Pro produce high-quality, ever-growing computational demands, computational demands due, produce high-quality images, Autoregressive image generation
备注:
点击查看摘要
Abstract:Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.
20. 【2510.18714】PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting
链接:https://arxiv.org/abs/2510.18714
作者:Changkun Liu,Bin Tan,Zeran Ke,Shangzhan Zhang,Jiachen Liu,Ming Qian,Nan Xue,Yujun Shen,Tristan Braud
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inherent geometric regularities, paper addresses metric, paper addresses, scenes by exploiting, exploiting their inherent
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). The project page is available at: [this https URL](https://lck666666.github.io/plana3r)
点击查看摘要
Abstract:This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at this https URL
21. 【2510.18705】A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
链接:https://arxiv.org/abs/2510.18705
作者:Peiqin Zhuang,Lei Bai,Yichao Wu,Ding Liang,Luping Zhou,Yali Wang,Wanli Ouyang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spatiotemporal contextual aggregation, motion modeling, motion modeling capacities, dominated by transformer-based, contextual aggregation capacities
备注: accepted by Pattern Recognition. We have been always curious to see whether our designs could be beneficial in other scenarios, such as embedding it into the DiT model or 3D-VAE for video generation. If you are interested in it, why not give it a shot?
点击查看摘要
Abstract:Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 V2.
22. 【2510.18703】Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
链接:https://arxiv.org/abs/2510.18703
作者:Yiqi Lin,Alex Jinpeng Wang,Linjie Li,Zhengyuan Yang,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CLIP have demonstrated, demonstrated strong performance, aligned image-text pairs, demonstrated strong, wide range
备注: Project page: this [this https URL](https://linyq17.github.io/VC2L/)
点击查看摘要
Abstract:Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: this https URL.
23. 【2510.18701】UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
链接:https://arxiv.org/abs/2510.18701
作者:Yibin Wang,Zhimin Li,Yuhang Zang,Jiazi Bu,Yujie Zhou,Yi Xin,Junjun He,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, accurately generated images, generated images reflect, underscores the importance, evaluating how accurately
备注: Project page: [this http URL](http://codegoat24.github.io/UniGenBench/)
点击查看摘要
Abstract:Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
24. 【2510.18692】MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
链接:https://arxiv.org/abs/2510.18692
作者:Weinan Jia,Yuning Lu,Mengqi Huang,Hualiang Wang,Binyuan Huang,Nan Chen,Mu Liu,Jidong Jiang,Zhendong Mao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, quadratic scaling, scaling of full, Transformers, Diffusion
备注: 15 pages, 12 figures
点击查看摘要
Abstract:Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
25. 【2510.18671】Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification
链接:https://arxiv.org/abs/2510.18671
作者:Hanif Rasyidi,Moshiur Farazi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep learning approaches, remains challenging due, handwriting styles, paper investigates, investigates various factors
备注: Published in The 12th IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2025
点击查看摘要
Abstract:This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.
26. 【2510.18668】Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
链接:https://arxiv.org/abs/2510.18668
作者:Mustafa Fuad Rifet Ibrahim,Tunc Alkanat,Maurice Meijer,Felix Manthey,Alexander Schlaefer,Peer Stelldinger
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:factors are detected, vast majority, risk factors, cardiovascular diseases, medical edge devices
备注: Submitted to the IEEE Journal of Biomedical And Health Informatics
点击查看摘要
Abstract:The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
27. 【2510.18660】Image augmentation with invertible networks in interactive satellite image change detection
链接:https://arxiv.org/abs/2510.18660
作者:Hichem Sahbi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interactive satellite image, detection algorithm based, change detection algorithm, satellite image change, paper devises
备注:
点击查看摘要
Abstract:This paper devises a novel interactive satellite image change detection algorithm based on active learning. Our framework employs an iterative process that leverages a question-and-answer model. This model queries the oracle (user) about the labels of a small subset of images (dubbed as display), and based on the oracle's responses, change detection model is dynamically updated. The main contribution of our framework resides in a novel invertible network that allows augmenting displays, by mapping them from highly nonlinear input spaces to latent ones, where augmentation transformations become linear and more tractable. The resulting augmented data are afterwards mapped back to the input space, and used to retrain more effective change detection criteria in the subsequent iterations of active learning. Experimental results demonstrate superior performance of our proposed method compared to the related work.
28. 【2510.18650】Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression
链接:https://arxiv.org/abs/2510.18650
作者:Kyo Kuroki,Yasuyuki Okoshi,Thiem Van Chu,Kazushi Kawamura,Masato Motomura
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
关键词:Binary Quadratic, binary quadratic expressions, paper proposes, Binary Quadratic Quantization, Binary
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
29. 【2510.18637】ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
链接:https://arxiv.org/abs/2510.18637
作者:Sheida Rahnamai Kordasiabi,Damian Dalle Nogare,Florian Jug
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:biological samples remains, life sciences, samples remains, remains a challenge, Semantic segmentation
备注: 10 pages main text, 17 pages total
点击查看摘要
Abstract:Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
30. 【2510.18636】C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression
链接:https://arxiv.org/abs/2510.18636
作者:Baptiste Bauvin,Loïc Baret,Ola Ahmad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:overcoming deployment constraints, gained increasing attention, computer vision applications, deployment constraints, compression has gained
备注: 10 pages, BMVC2025
点击查看摘要
Abstract:Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
31. 【2510.18632】hink with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
链接:https://arxiv.org/abs/2510.18632
作者:Zhangquan Chen,Manyuan Zhang,Xinlei Yu,Xufang Luo,Mingze Sun,Zihao Pan,Yan Feng,Peng Pei,Xunliang Cai,Ruqi Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieved remarkable progress, limited views remains, significant challenge, recent advances, advances in vision-language
备注: 12 pages, 4 figures
点击查看摘要
Abstract:Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at this https URL.
32. 【2510.18596】CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
链接:https://arxiv.org/abs/2510.18596
作者:Haojia Lin,Xiaoyu Tan,Yulei Qin,Zihan Xu,Yuchen Shi,Zongyi Li,Gang Li,Shaofei Cai,Siqi Cai,Chaoyou Fu,Ke Li,Xing Sun
类目:oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
关键词:enable task completion, First-ever Comprehensive CUA, Computer-using agents, Comprehensive CUA Reward, CUA Reward Benchmark
备注: 24 pages, 6 figures
点击查看摘要
Abstract:Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
33. 【2510.18583】CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
链接:https://arxiv.org/abs/2510.18583
作者:Yongmin Lee,Hye Won Chung
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:large-scale vision-language models, enables efficient training, dataset distillation aims, vision-language models, aims to synthesize
备注: NeurIPS 2025
点击查看摘要
Abstract:Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
34. 【2510.18573】Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
链接:https://arxiv.org/abs/2510.18573
作者:Zhenxing Zhang,Jiayan Teng,Zhuoyi Yang,Tiankun Cao,Cheng Wang,Xiaotao Gu,Jie Tang,Dan Guo,Meng Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:synthesize subject-consistent videos, subject-consistent videos conditioned, aims to synthesize, synthesize subject-consistent, subject-consistent videos
备注: 11 pages, 6 figures
点击查看摘要
Abstract:We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
35. 【2510.18552】Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
链接:https://arxiv.org/abs/2510.18552
作者:Sanjay Kumar,Tim Brophy,Reenu Mohandas,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires reliable performance, driving requires reliable, automated driving requires, requires reliable, reliable performance
备注:
点击查看摘要
Abstract:Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.
36. 【2510.18539】GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization
链接:https://arxiv.org/abs/2510.18539
作者:Dušan Malić,Christian Fruhwirth-Reisinger,Alexander Prutsch,Wei Lin,Samuel Schulter,Horst Possegger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:technical report outlines, solution for RoboSense, technical report, report outlines, outlines the top-ranking
备注: 1st place at the IROS'25 RoboSense Challenge, Track #3: Cross-Sensor Placement 3D Object Detection
点击查看摘要
Abstract:This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquote{geometric shortcut} when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model's ability to generalize, resulting in the exceptional performance demonstrated in this challenge.
37. 【2510.18521】RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
链接:https://arxiv.org/abs/2510.18521
作者:Junwen Huang,Shishir Reddy Vutukur,Peter KT Yu,Nassir Navab,Slobodan Ilic,Benjamin Busam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Typical template-based object, Typical template-based, pose pipelines estimate, closest matching template, pipelines estimate
备注:
点击查看摘要
Abstract:Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
38. 【2510.18513】DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
链接:https://arxiv.org/abs/2510.18513
作者:Suman Kunwar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficient waste sorting, waste sorting crucial, waste sorting, making efficient waste, sustainable waste management
备注: 8 pages, 8 figures
点击查看摘要
Abstract:The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes ( 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.
39. 【2510.18502】Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.18502
作者:Wei-Chia Chang,Yan-Ann Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:intelligent transportation systems, existing approaches struggle, newly released models, transportation systems, important task
备注: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
点击查看摘要
Abstract:Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
40. 【2510.18489】Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
链接:https://arxiv.org/abs/2510.18489
作者:Jinfeng Liu,Lingtong Kong,Mi Zhou,Jinwen Chen,Dan Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high dynamic range, low dynamic range, monocular low dynamic, dynamic range, unposed monocular low
备注: Project page is available at [this https URL](https://liujf1226.github.io/Mono4DGS-HDR/)
点击查看摘要
Abstract:We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.
41. 【2510.18457】Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
链接:https://arxiv.org/abs/2510.18457
作者:Tianci Bi,Xiaoyi Zhang,Yan Lu,Nanning Zheng
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision Foundation Models, Foundation Model Variational, Latent Diffusion Models, Model Variational Autoencoder, incorporating Vision Foundation
备注: Code and models available at: [this https URL](https://github.com/tianciB/VFM-VAE)
点击查看摘要
Abstract:The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
42. 【2510.18446】LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance
链接:https://arxiv.org/abs/2510.18446
作者:Anna Oliveras,Roger Marí,Rafael Redondo,Oriol Guardià,Ana Tost,Bhalaji Nagarajan,Carolina Migliorelli,Vicent Ribas,Petia Radeva
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:latent diffusion model, generate high-quality, chest CT scans, work introduces, latent diffusion
备注:
点击查看摘要
Abstract:This work introduces a new latent diffusion model to generate high-quality 3D chest CT scans conditioned on 3D anatomical masks. The method synthesizes volumetric images of size 256x256x256 at 1 mm isotropic resolution using a single mid-range GPU, significantly lowering the computational cost compared to existing approaches. The conditioning masks delineate lung and nodule regions, enabling precise control over the output anatomical features. Experimental results demonstrate that conditioning solely on nodule masks leads to anatomically incorrect outputs, highlighting the importance of incorporating global lung structure for accurate conditional synthesis. The proposed approach supports the generation of diverse CT volumes with and without lung nodules of varying attributes, providing a valuable tool for training AI models or healthcare professionals.
43. 【2510.18437】Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection
链接:https://arxiv.org/abs/2510.18437
作者:Ji Du,Xin Wang,Fangwei Hao,Mingyang Yu,Chunyuan Chen,Jiesheng Wu,Bin Wang,Jing Xu,Ping Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly similar surroundings, Camouflaged Object Detection, lies segmenting objects, Object Detection, lies segmenting
备注: ICCV 2025
点击查看摘要
Abstract:At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at this https URL.
44. 【2510.18433】ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
链接:https://arxiv.org/abs/2510.18433
作者:Yuanhe Guo,Linxi Xie,Zhuoran Chen,Kangrui Yu,Ryan Po,Guandao Yang,Gordon Wetztein,Hongyi Wen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:understand fine-grained individual, user preference annotations, user preference, studying generative models, fine-grained user preference
备注:
点击查看摘要
Abstract:We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
45. 【2510.18431】ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
链接:https://arxiv.org/abs/2510.18431
作者:Zhiwei Hao,Jianyuan Guo,Li Shen,Kai Han,Yehui Tang,Han Hu,Yunhe Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent advancements, demonstrated that larger, Recent, ScaleNet, models
备注:
点击查看摘要
Abstract:Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.
46. 【2510.18405】Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
链接:https://arxiv.org/abs/2510.18405
作者:Mst Jannatun Ferdous,Masum Billah,Joy Karmoker,Mohd Ruhul Ameen,Akif Islam,Md. Omar Faruqe
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:extract wicket-taking deliveries, deep learning techniques, leverages deep learning, model ball trajectories, detect cricket balls
备注: 6 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering 2025
点击查看摘要
Abstract:This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.
47. 【2510.18400】Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion
链接:https://arxiv.org/abs/2510.18400
作者:Linsong Shan,Zecan Yang,Laurence T. Yang,Changlong Li,Honglu Zhao,Xin Nie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:hyperspectral-multispectral image fusion, powerful tool, extensively employed, field of hyperspectral-multispectral, hyperspectral-multispectral image
备注:
点击查看摘要
Abstract:Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for manual parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios.
48. 【2510.18396】Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimer's Disease Classification
链接:https://arxiv.org/abs/2510.18396
作者:F.Ahmadi,B.Bidabad,H.Nasiri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Alzheimer disease, Alzheimer Disease Neuroimaging, brain imaging, anatomical structures, essential for analyzing
备注:
点击查看摘要
Abstract:Background and Objective: In brain imaging, geometric surface models are essential for analyzing the 3D shapes of anatomical structures. Alzheimer's disease (AD) is associated with significant cortical atrophy, making such shape analysis a valuable diagnostic tool. The objective of this study is to introduce and validate a novel local surface representation method for the automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted MRI scans from 160 participants (80 AD patients and 80 healthy controls) from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Cortical surface models were reconstructed from the MRI data using Freesurfer. Key geometric attributes were computed from the 3D meshes. Area distortion and conformal factor were derived using Ricci flow for conformal parameterization, while Gaussian curvature was calculated directly from the mesh geometry. Shannon entropy was applied to these three features to create compact and informative feature vectors. The feature vectors were used to train and evaluate a suite of classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results: Statistical significance of performance differences between classifiers was evaluated using paired Welch's t-test. The method proved highly effective in distinguishing AD patients from healthy controls. The Multi-Layer Perceptron (MLP) and Logistic Regression classifiers outperformed all others, achieving an accuracy and F$_1$ Score of 98.62%. Conclusions: This study confirms that the entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry. The high classification accuracy underscores the method's potential to enhance the study and diagnosis of Alzheimer's disease, offering a straightforward yet powerful tool for clinical research applications.
49. 【2510.18381】S2AP: Score-space Sharpness Minimization for Adversarial Pruning
链接:https://arxiv.org/abs/2510.18381
作者:Giorgio Piras,Qi Zhao,Fabio Brau,Maura Pintor,Christian Wressnegger,Battista Biggio
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:compressing neural networks, Adversarial pruning methods, Adversarial pruning, Sharpness-aware Adversarial Pruning, pruning methods
备注:
点击查看摘要
Abstract:Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.
50. 【2510.18377】Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
链接:https://arxiv.org/abs/2510.18377
作者:Yuqing Luo,Yixiao Li,Jiang Liu,Jun Fu,Hadi Amirpour,Guanghui Yue,Baoquan Zhao,Padraig Corcoran,Hantao Liu,Wei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scene Semantic Alignment, scene semantic, scene semantic information, inherent semantic diversity, cross-modal scene semantic
备注: 14 pages,2 figures, British Machine Vision Conference
点击查看摘要
Abstract:Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at this https URL.
51. 【2510.18362】FeatureFool: Zero-Query Fooling of Video Models via Feature Map
链接:https://arxiv.org/abs/2510.18362
作者:Duoxun Tang,Xi Xiao,Guangwu Hu,Kangkang Sun,Xiao Yang,Dongyang Chen,Qing Li,Yongjie Yin,Jiyao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep neural networks, neural networks, preliminarily verified, vulnerability of deep, deep neural
备注:
点击查看摘要
Abstract:The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70\% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.
52. 【2510.18358】Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
链接:https://arxiv.org/abs/2510.18358
作者:Firas Gabetni,Giuseppe Curci,Andrea Pilzer,Subhankar Roy,Elisa Ricci,Gianni Franchi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:deploying deep neural, Deep Ensembles, deep neural networks, safety-critical settings, essential for deploying
备注:
点击查看摘要
Abstract:Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.
53. 【2510.18357】Learning Human-Object Interaction as Groups
链接:https://arxiv.org/abs/2510.18357
作者:Jiajun Hong,Jianan Wei,Wenguan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:localize human-object pairs, Human-Object Interaction Detection, localize human-object, human-object pairs, aims to localize
备注:
点击查看摘要
Abstract:Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.
54. 【2510.18353】Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
链接:https://arxiv.org/abs/2510.18353
作者:Yi-Lun Wu,Bo-Kai Ruan,Chiang Tseng,Hong-Han Shuai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Direct preference optimization, shown strong potential, Direct preference, potential in aligning, paired comparisons
备注:
点击查看摘要
Abstract:Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at this https URL.
55. 【2510.18346】AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
链接:https://arxiv.org/abs/2510.18346
作者:Jiayu Zhang,Qilang Ye,Shuo Ye,Xun Lin,Zihan Song,Zitong Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Audio-Visual Question Answering, Question Answering, utilize both visual, visual and auditory, auditory modalities
备注: 13 pages, 9 figures
点击查看摘要
Abstract:Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
56. 【2510.18345】GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
链接:https://arxiv.org/abs/2510.18345
作者:Yudong Li,Hao Li,Xianxu Hou,Linlin Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural image understanding, facial knowledge learning, pre-training models, knowledge learning, large-scale pre-training models
备注: This work was initially drafted in November 2022
点击查看摘要
Abstract:Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
57. 【2510.18341】ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation
链接:https://arxiv.org/abs/2510.18341
作者:Kaiyuan Tan,Yingying Shen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Realistic view extrapolation, RealADSim Workshop NVS, Workshop NVS track, Realistic view, View Synthesis
备注:
点击查看摘要
Abstract:Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.
58. 【2510.18326】Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
链接:https://arxiv.org/abs/2510.18326
作者:Gao Yu Lee,Tanmoy Dam,Md Meftahul Ferdaus,Daniel Puiu Poenar,Vu Duong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:analyzing critical photographic, necessitates advanced visual, recognition techniques capable, human-induced disasters necessitates, disasters necessitates advanced
备注: Submitted to a SN journal
点击查看摘要
Abstract:The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.
59. 【2510.18321】Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
链接:https://arxiv.org/abs/2510.18321
作者:Jinlin Li,Yuran Wang,Yifei Yuan,Xiao Zhou,Yingying Zhang,Xixian Yong,Yefeng Zheng,Xian Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, visual question answering, recently achieved impressive, achieved impressive results, Large Vision-Language
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at this https URL.
60. 【2510.18313】OmniNWM: Omniscient Driving Navigation World Models
链接:https://arxiv.org/abs/2510.18313
作者:Bohan Li,Zhuang Ma,Dalong Du,Baorui Peng,Zhujin Liang,Zhenqiang Liu,Chao Ma,Yueming Jin,Hao Zhao,Wenjun Zeng,Xin Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autonomous driving world, Autonomous driving, expected to work, work effectively, Autonomous
备注: [this https URL](https://arlo0o.github.io/OmniNWM/)
点击查看摘要
Abstract:Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at this https URL.
61. 【2510.18304】he Impact of Image Resolution on Biomedical Multimodal Large Language Models
链接:https://arxiv.org/abs/2510.18304
作者:Liangyu Chen,James Burgess,Jeffrey J Nirschl,Orr Zohar,Serena Yeung-Levy
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Imaging technologies, modern medicine, technologies are fundamental, requiring analysis, biomedical image analysis
备注: Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025
点击查看摘要
Abstract:Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
62. 【2510.18303】Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
链接:https://arxiv.org/abs/2510.18303
作者:Lehan Wang,Yi Qin,Honglong Yang,Xiaomeng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Large Language, provide reliable diagnosis, Multimodal Large
备注: Work in progress
点击查看摘要
Abstract:Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at this https URL.
63. 【2510.18291】GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation
链接:https://arxiv.org/abs/2510.18291
作者:Tuan Pham,Thanh-Tung Le,Xiaohui Xie,Stephan Mandt
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stereo vision guidance, diffusion-based monocular depth, enhances pretrained diffusion-based, pretrained diffusion-based monocular, monocular depth estimation
备注: Accepted to ICCV Findings 2025. The first two authors contributed equally. The last two authors share co-corresponding authorship
点击查看摘要
Abstract:We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.
64. 【2510.18287】Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
链接:https://arxiv.org/abs/2510.18287
作者:Vishal Vinod
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:modifying expression etc., editing, removing eyeglasses, enables modifying, modifying expression
备注: 14 pages, 7 figures
点击查看摘要
Abstract:Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: this https URL
65. 【2510.18269】StreamingTOM: Streaming Token Compression for Efficient Video Understanding
链接:https://arxiv.org/abs/2510.18269
作者:Xueyi Chen,Keda Tao,Kele Shao,Huan Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:vision-language models face, Unlike offline processing, Unlike offline, video vision-language models, fundamental constraints
备注:
点击查看摘要
Abstract:Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
66. 【2510.18268】reeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation
链接:https://arxiv.org/abs/2510.18268
作者:Yucheng Song,Chenxi Li,Haokang Ding,Zhining Liao,Zhifang Liao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:addressing challenges related, traditional federated learning, federated domain generalization, Federated Learning, federated learning methods
备注:
点击查看摘要
Abstract:In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.
67. 【2510.18267】Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization
链接:https://arxiv.org/abs/2510.18267
作者:Xiang Zhang,Suping Wu,Sheng Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human mesh, human mesh recovery, reconstructed human mesh, mesh, complex scenes
备注: Accepted by ICME2025
点击查看摘要
Abstract:Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.
68. 【2510.18263】From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
链接:https://arxiv.org/abs/2510.18263
作者:Ziwei Huang,Ying Shu,Hao Fang,Quanyu Long,Wenya Wang,Qiushi Guo,Tiezheng Ge,Leilei Gan
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Subject-driven image generation, generation models face, Subject-driven image, face a fundamental, fundamental trade-off
备注:
点击查看摘要
Abstract:Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
69. 【2510.18262】UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
链接:https://arxiv.org/abs/2510.18262
作者:Da Zhang,Chenggang Rong,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, Large vision-language models, remains largely unexplored, Large vision-language, largely unexplored
备注: We have released V1, which only reports the test results. Our work is still ongoing, and the next version will be coming soon
点击查看摘要
Abstract:Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
70. 【2510.18256】Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery
链接:https://arxiv.org/abs/2510.18256
作者:Xiang Zhang,Suping Wu,Weibin Qiu,Zhaocheng Jin,Sheng Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:temporal motion, temporal motion prior, motion, temporal, human meshes
备注: Accepted by ICME2025
点击查看摘要
Abstract:3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It's hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.
71. 【2510.18253】OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
链接:https://arxiv.org/abs/2510.18253
作者:Tianyu Huang,Runnan Chen,Dongting Hu,Fengming Huang,Mingming Gong,Tongliang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving, augmented reality, pivotal for autonomous, Gaussian Splatting approaches, semantic Gaussian Splatting
备注:
点击查看摘要
Abstract:Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
72. 【2510.18244】BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
链接:https://arxiv.org/abs/2510.18244
作者:Ajinkya Khoche,Gergő László Nagy,Maciej Wozniak,Thomas Gustafsson,Patric Jensfelt
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:noisy LiDAR scans, LiDAR scans encountered, noisy LiDAR, significant domain gap, classification is crucial
备注: Under Review
点击查看摘要
Abstract:Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5\% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27\%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3\% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on this https URL.
Comments:
Under Review
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2510.18244 [cs.CV]
(or
arXiv:2510.18244v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.18244
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
73. 【2510.18234】DeepSeek-OCR: Contexts Optical Compression
链接:https://arxiv.org/abs/2510.18234
作者:Haoran Wei,Yaofeng Sun,Yukun Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compressing long contexts, contexts via optical, initial investigation, feasibility of compressing, compressing long
备注:
点击查看摘要
Abstract:We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at this http URL.
74. 【2510.18229】Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis
链接:https://arxiv.org/abs/2510.18229
作者:Xinhao Cai,Liulei Li,Gensheng Pei,Tao Chen,Jinshan Pan,Yazhou Yao,Wenguan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation-based debiasing framework, paper presents, presents a generation-based, generation-based debiasing, debiasing framework
备注:
点击查看摘要
Abstract:This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly narrows the performance gap for underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.
75. 【2510.18214】VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
链接:https://arxiv.org/abs/2510.18214
作者:Shruti Palaskar,Leon Gatys,Mona Abdelrahman,Mar Jacobo,Larry Lindsey,Rutika Moharir,Gunnar Lund,Yang Xu,Navid Shiee,Jeffrey Bigham,Charles Maalouf,Joseph Yitan Cheng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:language inputs separately, inputs separately, missing risks, Vision Language Safety, interpretation where benign
备注: 10 pages, 5 figures, 4 tables. Under review
点击查看摘要
Abstract:Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
76. 【2510.18213】EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation
链接:https://arxiv.org/abs/2510.18213
作者:Maryam Dialameh,Hossein Rajabzadeh,Jung Suk Sim,Hyock Ju Kwon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Papillary thyroid microcarcinoma, accurate lesion segmentation, remains difficult due, Papillary thyroid, videos remains difficult
备注:
点击查看摘要
Abstract:Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2) to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by 29\%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2--5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1\% FLOPs, preserving real-time throughput of $\sim$30\,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code {this https URL}.
77. 【2510.18193】FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo
链接:https://arxiv.org/abs/2510.18193
作者:Keivan Shariatmadar,Ahmad Osman,Ramin Ray,Usman Dildar,Kisam Kim
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Olympic and Paralympic, Paralympic combat sports, explainable decision-making remains, challenge in Olympic, Paralympic combat
备注: 23 pages, 12 figures
点击查看摘要
Abstract:Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{this http URL 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human--AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, this http URL~2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85\% reduction in decision review time} and {93\% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, this http URL~2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.
78. 【2510.18189】A Generalizable Light Transport 3D Embedding for Global Illumination
链接:https://arxiv.org/abs/2510.18189
作者:Bing Xu,Mukund Varma T,Cheng Wang,Tzumao Li,Lifan Wu,Bartlomiej Wronski,Ravi Ramamoorthi,Marco Salvi
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:remains computationally expensive, computationally expensive due, simulating indirect light, essential for realistic, remains computationally
备注:
点击查看摘要
Abstract:Global illumination (GI) is essential for realistic rendering but remains computationally expensive due to the complexity of simulating indirect light transport. Recent neural methods have mainly relied on per-scene optimization, sometimes extended to handle changes in camera or geometry. Efforts toward cross-scene generalization have largely stayed in 2D screen space, such as neural denoising or G-buffer based GI prediction, which often suffer from view inconsistency and limited spatial understanding. We propose a generalizable 3D light transport embedding that approximates global illumination directly from 3D scene configurations, without using rasterized or path-traced cues. Each scene is represented as a point cloud with geometric and material features. A scalable transformer models global point-to-point interactions to encode these features into neural primitives. At render time, each query point retrieves nearby primitives via nearest-neighbor search and aggregates their latent features through cross-attention to predict the desired rendering quantity. We demonstrate results on diffuse global illumination prediction across diverse indoor scenes with varying layouts, geometry, and materials. The embedding trained for irradiance estimation can be quickly adapted to new rendering tasks with limited fine-tuning. We also present preliminary results for spatial-directional radiance field estimation for glossy materials and show how the normalized field can accelerate unbiased path guiding. This approach highlights a path toward integrating learned priors into rendering pipelines without explicit ray-traced illumination cues.
79. 【2510.18188】RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology
链接:https://arxiv.org/abs/2510.18188
作者:Chengrun Li,Corentin Royer,Haozhe Luo,Bastian Wittmann,Xia Li,Ibrahim Hamamci,Sezgin Er,Anjany Sekuboyina,Bjoern Menze
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:complex visual questions, jointly generate diagnostic, current medical vision, medical vision language, generate diagnostic text
备注:
点击查看摘要
Abstract:Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.
80. 【2510.18187】VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis
链接:https://arxiv.org/abs/2510.18187
作者:Fatima AlGhamdi,Omar Alharbi,Abdullah Aldwyish,Raied Aljadaany,Muhammad Kamran J Khan,Huda Alamri
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:severe inter-person occlusions, Detecting anomalies, highly dynamic, scenes is challenging, challenging due
备注: 8 pages, 3 figures
点击查看摘要
Abstract:Detecting anomalies in crowded scenes is challenging due to severe inter-person occlusions and highly dynamic, context-dependent motion patterns. Existing approaches often struggle to adapt to varying crowd densities and lack interpretable anomaly indicators. To address these limitations, we introduce VelocityNet, a dual-pipeline framework that combines head detection and dense optical flow to extract person-specific velocities. Hierarchical clustering categorizes these velocities into semantic motion classes (halt, slow, normal, and fast), and a percentile-based anomaly scoring system measures deviations from learned normal patterns. Experiments demonstrate the effectiveness of our framework in real-time detection of diverse anomalous motion patterns within densely crowded environments.
81. 【2510.18172】Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset
链接:https://arxiv.org/abs/2510.18172
作者:Clementine Grethen,Simone Gasparini,Geraldine Morin,Jeremy Lebreton,Lucas Marti,Manuel Sanchez-Gestido
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:space exploration, essential for space, Accurate, lunar South Pole, lunar
备注: Accepted to ICCV workshop 2025. The project page can be accessed via this [this https URL](https://clementinegrethen.github.io/publications/3D-Vast-ICCV2025.html) URL. The source code is available at this [this https URL](https://github.com/clementinegrethen/StereoLunar) URL
点击查看摘要
Abstract:Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon's lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.
82. 【2510.18135】World-in-World: World Models in a Closed-Loop World
链接:https://arxiv.org/abs/2510.18135
作者:Jiahan Zhang,Muqing Jiang,Nanru Dai,Taiming Lu,Arda Uzunoglu,Shunchi Zhang,Yana Wei,Jiahao Wang,Vishal M. Patel,Paul Pu Liang,Daniel Khashabi,Cheng Peng,Rama Chellappa,Tianmin Shu,Alan Yuille,Yilun Du,Jieneng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:striking visual realism, Generative world models, endow embodied agents, Generative world, naturally raises
备注: Code is at [this https URL](https://github.com/World-In-World/world-in-world)
点击查看摘要
Abstract:Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
83. 【2510.18123】SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving
链接:https://arxiv.org/abs/2510.18123
作者:Xiangbo Gao,Tzu-Hsiang Lin,Ruojing Song,Yuheng Wu,Kuan-Ru Huang,Zicheng Jin,Fangzhou Lin,Shinan Liu,Zhengzhong Tu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:driving systems leverage, multiple agents, agents to enhance, enhance driving safety, Collaborative driving systems
备注:
点击查看摘要
Abstract:Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is this https URL.
84. 【2510.18117】Online In-Context Distillation for Low-Resource Vision Language Models
链接:https://arxiv.org/abs/2510.18117
作者:Zhiqi Kang,Rahaf Aljundi,Vaggelis Dorovatas,Karteek Alahari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:critical question, thrive in low-resource, field continues, continues its push, work turns
备注:
点击查看摘要
Abstract:As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.
85. 【2510.18101】From Volume Rendering to 3D Gaussian Splatting: Theory and Applications
链接:https://arxiv.org/abs/2510.18101
作者:Vitor Pereira Matias,Daniel Perazzo,Vinicius Silva,Alberto Raposo,Luiz Velho,Afonso Paiva,Tiago Novello
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental transformation, driven by continuous, posed images, images is undergoing, undergoing a fundamental
备注: Accepted at the Conference on Graphics, Patterns and Images (SIBGRAPI), math focused, 5 equations, 5 Figure, 5 pages of text and 1 of bibligraphy
点击查看摘要
Abstract:The problem of 3D reconstruction from posed images is undergoing a fundamental transformation, driven by continuous advances in 3D Gaussian Splatting (3DGS). By modeling scenes explicitly as collections of 3D Gaussians, 3DGS enables efficient rasterization through volumetric splatting, offering thus a seamless integration with common graphics pipelines. Despite its real-time rendering capabilities for novel view synthesis, 3DGS suffers from a high memory footprint, the tendency to bake lighting effects directly into its representation, and limited support for secondary-ray effects. This tutorial provides a concise yet comprehensive overview of the 3DGS pipeline, starting from its splatting formulation and then exploring the main efforts in addressing its limitations. Finally, we survey a range of applications that leverage 3DGS for surface reconstruction, avatar modeling, animation, and content generation-highlighting its efficient rendering and suitability for feed-forward pipelines.
86. 【2510.18091】Accelerating Vision Transformers with Adaptive Patch Sizes
链接:https://arxiv.org/abs/2510.18091
作者:Rohan Choudhury,JungEun Kim,Jinhyung Park,Eunho Yang,László A. Jeni,Kris M. Kitani
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Vision Transformers, Adaptive Patch Transformers, uniformly sized patches, long input sequence, input sequence lengths
备注: Project page at [this https URL](https://rccchoudhury.github.io/apt/)
点击查看摘要
Abstract:Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.
87. 【2510.18089】Big Data, Tiny Targets: An Exploratory Study in Machine Learning-enhanced Detection of Microplastic from Filters
链接:https://arxiv.org/abs/2510.18089
作者:Paul-Tiberiu Miclea,Martin Sboron,Hardik Vaghasiya,Hoang Thinh Nguyen,Meet Gadara,Thomas Schmid
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scanning Electron Microscopy, Atomic Force Microscopy, human health, ubiquitous pollutants, pollutants with demonstrated
备注:
点击查看摘要
Abstract:Microplastics (MPs) are ubiquitous pollutants with demonstrated potential to impact ecosystems and human health. Their microscopic size complicates detection, classification, and removal, especially in biological and environmental samples. While techniques like optical microscopy, Scanning Electron Microscopy (SEM), and Atomic Force Microscopy (AFM) provide a sound basis for detection, applying these approaches requires usually manual analysis and prevents efficient use in large screening studies. To this end, machine learning (ML) has emerged as a powerful tool in advancing microplastic detection. In this exploratory study, we investigate potential, limitations and future directions of advancing the detection and quantification of MP particles and fibres using a combination of SEM imaging and machine learning-based object detection. For simplicity, we focus on a filtration scenario where image backgrounds exhibit a symmetric and repetitive pattern. Our findings indicate differences in the quality of YOLO models for the given task and the relevance of optimizing preprocessing. At the same time, we identify open challenges, such as limited amounts of expert-labeled data necessary for reliable training of ML models.
88. 【2510.18083】Chimera: Compositional Image Generation using Part-based Concepting
链接:https://arxiv.org/abs/2510.18083
作者:Shivam Singh,Yiming Chen,Agneet Chatterjee,Amit Raj,James Hays,Yezhou Yang,Chitra Baral
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:lack explicit control, Personalized image generative, multiple source images, masks or annotations, Personalized image
备注:
点击查看摘要
Abstract:Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.
89. 【2510.18054】HouseTour: A Virtual Real Estate A(I)gent
链接:https://arxiv.org/abs/2510.18054
作者:Ata Çelen,Marc Pollefeys,Daniel Barath,Iro Armeni
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:natural language summary, language summary generation, natural language, language summary, collection of images
备注: Published on ICCV 2025
点击查看摘要
Abstract:We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
90. 【2510.18038】riggerNet: A Novel Explainable AI Framework for Red Palm Mite Detection and Multi-Model Comparison and Heuristic-Guided Annotation
链接:https://arxiv.org/abs/2510.18038
作者:Harshini Suresha,Kavitha SH
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:extensive palm cultivation, red palm mite, leading to reduced, economic losses, palm
备注: 17 pages, 9 figures
点击查看摘要
Abstract:The red palm mite infestation has become a serious concern, particularly in regions with extensive palm cultivation, leading to reduced productivity and economic losses. Accurate and early identification of mite-infested plants is critical for effective management. The current study focuses on evaluating and comparing the ML model for classifying the affected plants and detecting the infestation. TriggerNet is a novel interpretable AI framework that integrates Grad-CAM, RISE, FullGrad, and TCAV to generate novel visual explanations for deep learning models in plant classification and disease detection. This study applies TriggerNet to address red palm mite (Raoiella indica) infestation, a major threat to palm cultivation and agricultural productivity. A diverse set of RGB images across 11 plant species, Arecanut, Date Palm, Bird of Paradise, Coconut Palm, Ginger, Citrus Tree, Palm Oil, Orchid, Banana Palm, Avocado Tree, and Cast Iron Plant was utilized for training and evaluation. Advanced deep learning models like CNN, EfficientNet, MobileNet, ViT, ResNet50, and InceptionV3, alongside machine learning classifiers such as Random Forest, SVM, and KNN, were employed for plant classification. For disease classification, all plants were categorized into four classes: Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing. Snorkel was used to efficiently label these disease classes by leveraging heuristic rules and patterns, reducing manual annotation time and improving dataset reliability.
91. 【2510.18034】SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection
链接:https://arxiv.org/abs/2510.18034
作者:Roberto Brusnicki,David Pop,Yuan Gao,Mattia Piccinini,Johannes Betz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:remain critically vulnerable, systems remain critically, Vision Language Models, long-tail of rare, remain critically
备注: 8 pages, 5 figures
点击查看摘要
Abstract:Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.
92. 【2510.18016】ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
链接:https://arxiv.org/abs/2510.18016
作者:Prateek Gothwal,Deeptimaan Banerjee,Ashis Kumer Biswas
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:online learning environments, improving student outcomes, personalizing instruction, environments is vital, vital for improving
备注: 10 pages, 4 figures, 2 tables
点击查看摘要
Abstract:Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43\% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on this https URL .
93. 【2510.18014】ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy
链接:https://arxiv.org/abs/2510.18014
作者:Kazuki Kawamura,Kengo Nakai,Jun Rekimoto
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Japanese manzai comedy, capturing facial videos, large scale multimodal, Japanese manzai, scale multimodal dataset
备注: ICCV 2025 Workshop on Affective Behavior Analysis in-the-Wild (ABAW), Honolulu, HI, USA (Oct 19, 2025, HST). 11 pages, 5 figures
点击查看摘要
Abstract:We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched = 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p 0.001, permutation p 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.
94. 【2510.17999】Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods
链接:https://arxiv.org/abs/2510.17999
作者:Ghazal Danaee,Marc Niethammer,Jarrett Rushmore,Sylvain Bouix
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, segmentation, algorithms have substantially, substantially advanced, advanced the field
备注:
点击查看摘要
Abstract:Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.
95. 【2510.17991】Demystifying Transition Matching: When and Why It Can Beat Flow Matching
链接:https://arxiv.org/abs/2510.17991
作者:Jaihoon Kim,Rajarshi Saha,Minhyuk Sung,Youngsuk Park
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Flow Matching, Transition Matching, achieve higher quality, fewer sampling steps, generative models
备注:
点击查看摘要
Abstract:Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.
96. 【2510.17914】NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation
链接:https://arxiv.org/abs/2510.17914
作者:Rikard Vinge,Isabelle Wittmann,Jannik Schneider,Michael Marszalek,Luis Gilch,Thomas Brunschwiler,Conrad M Albrecht
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth Observation, context of Earth, framework for evaluating, benchmark framework, representation learning
备注:
点击查看摘要
Abstract:We introduce NeuCo-Bench, a novel benchmark framework for evaluating (lossy) neural compression and representation learning in the context of Earth Observation (EO). Our approach builds on fixed-size embeddings that act as compact, task-agnostic representations applicable to a broad range of downstream tasks. NeuCo-Bench comprises three core components: (i) an evaluation pipeline built around reusable embeddings, (ii) a new challenge mode with a hidden-task leaderboard designed to mitigate pretraining bias, and (iii) a scoring system that balances accuracy and stability. To support reproducibility, we release SSL4EO-S12-downstream, a curated multispectral, multitemporal EO dataset. We present initial results from a public challenge at the 2025 CVPR EARTHVISION workshop and conduct ablations with state-of-the-art foundation models. NeuCo-Bench provides a first step towards community-driven, standardized evaluation of neural embeddings for EO and beyond.
97. 【2510.17885】Metrics and evaluations for computational and sustainable AI efficiency
链接:https://arxiv.org/abs/2510.17885
作者:Hongyuan Liu,Xinyang Liu,Guosheng Hu
类目:Performance (cs.PF); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:created unprecedented demands, Artificial Intelligence, models remain fragmented, deployed models remain, advancement of Artificial
备注: 11 pages, 2 tables
点击查看摘要
Abstract:The rapid advancement of Artificial Intelligence (AI) has created unprecedented demands for computational power, yet methods for evaluating the performance, efficiency, and environmental impact of deployed models remain fragmented. Current approaches often fail to provide a holistic view, making it difficult to compare and optimise systems across heterogeneous hardware, software stacks, and numeric precisions. To address this gap, we propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics under realistic serving conditions. Our framework provides a pragmatic, carbon-aware evaluation by systematically measuring latency and throughput distributions, energy consumption, and location-adjusted carbon emissions, all while maintaining matched accuracy constraints for valid comparisons. We apply this methodology to multi-precision models across diverse hardware platforms, from data-centre accelerators like the GH200 to consumer-level GPUs such as the RTX 4090, running on mainstream software stacks including PyTorch, TensorRT, and ONNX Runtime. By systematically categorising these factors, our work establishes a rigorous benchmarking framework that produces decision-ready Pareto frontiers, clarifying the trade-offs between accuracy, latency, energy, and carbon. The accompanying open-source code enables independent verification and facilitates adoption, empowering researchers and practitioners to make evidence-based decisions for sustainable AI deployment.
98. 【2510.17875】3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement
链接:https://arxiv.org/abs/2510.17875
作者:Xiaoxu Xu,Xuexun Liu,Jinlong Li,Yitian Yuan,Qiudan Zhang,Lin Ma,Nicu Sebe,Xu Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:low-cost annotated data, significantly reducing reliance, dense point-wise annotations, weakly supervised semantic, WSSS models
备注:
点击查看摘要
Abstract:3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low-cost annotated data, significantly reducing reliance on dense point-wise annotations. Previous works mainly employ class activation maps or pre-trained vision-language models to address this challenge. However, the low quality of pseudo-labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high-performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class-aware guidance mechanism to generate high-fidelity pseudo labels. Concretely, our designed methodology first employs Class-Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category-specific optimization. Subsequently, the Geometry-Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low-confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self-Training to propagate labels into these areas. This iterative process continuously enhances pseudo-label quality while expanding label coverage, ultimately fostering the development of high-performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state-of-the-art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.
99. 【2510.17873】Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach
链接:https://arxiv.org/abs/2510.17873
作者:Tadesse K Bahiru,Natnael Tilahun Sinshaw,Teshager Hailemariam Moges,Dheeraj Kumar Singh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:amplify demographic imbalances, Gender classification systems, training data, systems often inherit, inherit and amplify
备注:
点击查看摘要
Abstract:Gender classification systems often inherit and amplify demographic imbalances in their training data. We first audit five widely used gender classification datasets, revealing that all suffer from significant intersectional underrepresentation. To measure the downstream impact of these flaws, we train identical MobileNetV2 classifiers on the two most balanced of these datasets, UTKFace and FairFace. Our fairness evaluation shows that even these models exhibit significant bias, misclassifying female faces at a higher rate than male faces and amplifying existing racial skew. To counter these data-induced biases, we construct BalancedFace, a new public dataset created by blending images from FairFace and UTKFace, supplemented with images from other collections to fill missing demographic gaps. It is engineered to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images. When a standard classifier is trained on BalancedFace, it reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, all with a minimal loss of overall accuracy. These results underline the profound value of data-centric interventions and provide an openly available resource for fair gender classification research.
100. 【2510.17869】GAN-based Content-Conditioned Generation of Handwritten Musical Symbols
链接:https://arxiv.org/abs/2510.17869
作者:Gerard Asbert,Pau Torras,Lei Kang,Alicia Fornés,Josep Lladós
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Optical Music Recognition, real annotated data, handwritten historical musical, Handwritten Text Recognition, historical musical scores
备注: 15 pages, 5 figures, Accepted at ICDAR workshop GREC 2025
点击查看摘要
Abstract:The field of Optical Music Recognition (OMR) is currently hindered by the scarcity of real annotated data, particularly when dealing with handwritten historical musical scores. In similar fields, such as Handwritten Text Recognition, it was proven that synthetic examples produced with image generation techniques could help to train better-performing recognition architectures. This study explores the generation of realistic, handwritten-looking scores by implementing a music symbol-level Generative Adversarial Network (GAN) and assembling its output into a full score using the Smashcima engraving software. We have systematically evaluated the visual fidelity of these generated samples, concluding that the generated symbols exhibit a high degree of realism, marking significant progress in synthetic score generation.
101. 【2510.17866】MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
链接:https://arxiv.org/abs/2510.17866
作者:Sungmin Cho,Sungbum Park,Insoo Oh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Uncertainty-aware Similarity Estimation, Model-based Uncertainty-aware Similarity, training-free framework designed, Similarity Estimation, Model-based Uncertainty-aware
备注: 11 pages with 6 figures
点击查看摘要
Abstract:In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.
102. 【2510.17864】InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation
链接:https://arxiv.org/abs/2510.17864
作者:Jungmin Lee,Seonghyuk Hong,Juyong Lee,Jaeyoon Lee,Jongwon Choi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-fidelity RGB surface, RGB surface details, subsurface X-ray structures, X-ray radiative Gaussian, RGB and X-ray
备注: Published at ICCV 2025
点击查看摘要
Abstract:We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and non-destructive testing capabilities across various domains.
103. 【2510.17863】Robotic Classification of Divers' Swimming States using Visual Pose Keypoints as IMUs
链接:https://arxiv.org/abs/2510.17863
作者:Demetrious T. Kutzke,Ying-Kun Wu,Elizabeth Terveen,Junaed Sattar
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Traditional human activity, inertial measurement units, direct image analysis, wearable inertial measurement, challenging underwater environments
备注:
点击查看摘要
Abstract:Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a ``pseudo-IMU'' from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an Autonomous Underwater Vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arrest -- a leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety.
104. 【2510.17860】DMTrack: Deformable State-Space Modeling for UAV Multi-Object Tracking with Kalman Fusion and Uncertainty-Aware Association
链接:https://arxiv.org/abs/2510.17860
作者:Zenghuang Fu,Xiaofeng Han,Mingda Jia,Jin ming Yang,Qi Zeng,Muyang Zahng,Changwei Wang,Weiliang Meng,Xiaopeng Zhang
类目:ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
关键词:unmanned aerial vehicles, presents unique challenges, unique challenges due, unpredictable object motion, limited appearance cues
备注:
点击查看摘要
Abstract:Multi-object tracking (MOT) from unmanned aerial vehicles (UAVs) presents unique challenges due to unpredictable object motion, frequent occlusions, and limited appearance cues inherent to aerial viewpoints. These issues are further exacerbated by abrupt UAV movements, leading to unreliable trajectory estimation and identity switches. Conventional motion models, such as Kalman filters or static sequence encoders, often fall short in capturing both linear and non-linear dynamics under such conditions. To tackle these limitations, we propose DMTrack, a deformable motion tracking framework tailored for UAV-based MOT. Our DMTrack introduces three key components: DeformMamba, a deformable state-space predictor that dynamically aggregates historical motion states for adaptive trajectory modeling; MotionGate, a lightweight gating module that fuses Kalman and Mamba predictions based on motion context and uncertainty; and an uncertainty-aware association strategy that enhances identity preservation by aligning motion trends with prediction confidence. Extensive experiments on the VisDrone-MOT and UAVDT benchmarks demonstrate that our DMTrack achieves state-of-the-art performance in identity consistency and tracking accuracy, particularly under high-speed and non-linear motion. Importantly, our method operates without appearance models and maintains competitive efficiency, highlighting its practicality for robust UAV-based tracking.
105. 【2510.17858】Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch
链接:https://arxiv.org/abs/2510.17858
作者:Xu Cai,Yang Wu,Qianli Chen,Haoran Wu,Lichuan Xiang,Hongkai Wen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:shortcutting large-scale pre-trained, large-scale pre-trained flow, ultra-efficient post-training method, pre-trained flow matching, velocity field self-distillation
备注: NeurIPS 2025
点击查看摘要
Abstract:We present an ultra-efficient post-training method for shortcutting large-scale pre-trained flow matching diffusion models into efficient few-step samplers, enabled by novel velocity field self-distillation. While shortcutting in flow matching, originally introduced by shortcut models, offers flexible trajectory-skipping capabilities, it requires a specialized step-size embedding incompatible with existing models unless retraining from scratch$\unicode{x2013}$a process nearly as costly as pretraining itself. Our key contribution is thus imparting a more aggressive shortcut mechanism to standard flow matching models (e.g., Flux), leveraging a unique distillation principle that obviates the need for step-size embedding. Working on the velocity field rather than sample space and learning rapidly from self-guided distillation in an online manner, our approach trains efficiently, e.g., producing a 3-step Flux less than one A100 day. Beyond distillation, our method can be incorporated into the pretraining stage itself, yielding models that inherently learn efficient, few-step flows without compromising quality. This capability also enables, to our knowledge, the first few-shot distillation method (e.g., 10 text-image pairs) for dozen-billion-parameter diffusion models, delivering state-of-the-art performance at almost free cost.
Comments:
NeurIPS 2025
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2510.17858 [cs.CV]
(or
arXiv:2510.17858v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.17858
Focus to learn more
arXiv-issued DOI via DataCite</p>
106. 【2510.17855】CMIS-Net: A Cascaded Multi-Scale Individual Standardization Network for Backchannel Agreement Estimation
链接:https://arxiv.org/abs/2510.17855
作者:Yuxuan Huang,Kangzhong Wang,Eugene Yujun Fu,Grace Ngai,Peter H.F. Ng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:subtle listener responses, short verbal cues, subtle listener, short verbal, convey understanding
备注:
点击查看摘要
Abstract:Backchannels are subtle listener responses, such as nods, smiles, or short verbal cues like "yes" or "uh-huh," which convey understanding and agreement in conversations. These signals provide feedback to speakers, improve the smoothness of interaction, and play a crucial role in developing human-like, responsive AI systems. However, the expression of backchannel behaviors is often significantly influenced by individual differences, operating across multiple scales: from instant dynamics such as response intensity (frame-level) to temporal patterns such as frequency and rhythm preferences (sequence-level). This presents a complex pattern recognition problem that contemporary emotion recognition methods have yet to fully address. Particularly, existing individualized methods in emotion recognition often operate at a single scale, overlooking the complementary nature of multi-scale behavioral cues. To address these challenges, we propose a novel Cascaded Multi-Scale Individual Standardization Network (CMIS-Net) that extracts individual-normalized backchannel features by removing person-specific neutral baselines from observed expressions. Operating at both frame and sequence levels, this normalization allows model to focus on relative changes from each person's baseline rather than absolute expression values. Furthermore, we introduce an implicit data augmentation module to address the observed training data distributional bias, improving model generalization. Comprehensive experiments and visualizations demonstrate that CMIS-Net effectively handles individual differences and data imbalance, achieving state-of-the-art performance in backchannel agreement detection.
107. 【2510.17854】Provenance of AI-Generated Images: A Vector Similarity and Blockchain-based Approach
链接:https://arxiv.org/abs/2510.17854
作者:Jitendra Sharma,Arthur Carvalho,Suman Bhunia
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:Rapid advancement, contextually relevant digital, large language models, relevant digital content, Stable Diffusion techniques
备注:
点击查看摘要
Abstract:Rapid advancement in generative AI and large language models (LLMs) has enabled the generation of highly realistic and contextually relevant digital content. LLMs such as ChatGPT with DALL-E integration and Stable Diffusion techniques can produce images that are often indistinguishable from those created by humans, which poses challenges for digital content authentication. Verifying the integrity and origin of digital data to ensure it remains unaltered and genuine is crucial to maintaining trust and legality in digital media. In this paper, we propose an embedding-based AI image detection framework that utilizes image embeddings and a vector similarity to distinguish AI-generated images from real (human-created) ones. Our methodology is built on the hypothesis that AI-generated images demonstrate closer embedding proximity to other AI-generated content, while human-created images cluster similarly within their domain. To validate this hypothesis, we developed a system that processes a diverse dataset of AI and human-generated images through five benchmark embedding models. Extensive experimentation demonstrates the robustness of our approach, and our results confirm that moderate to high perturbations minimally impact the embedding signatures, with perturbed images maintaining close similarity matches to their original versions. Our solution provides a generalizable framework for AI-generated image detection that balances accuracy with computational efficiency.
108. 【2510.17851】Pre to Post-Treatment Glioblastoma MRI Prediction using a Latent Diffusion Model
链接:https://arxiv.org/abs/2510.17851
作者:Alexandre G. Leclercq,Sébastien Bougleux,Noémie N. Moreau,Alexis Desmonts,Romain Hérault,Aurélien Corroyer-Dulmont
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:aggressive primary brain, primary brain tumor, aggressive primary, primary brain, MRI
备注: 10 pages, 4 figures. Presented to the Deep Generative Models Workshop of MICCAI (DGM4MICCAI)
点击查看摘要
Abstract:Glioblastoma (GBM) is an aggressive primary brain tumor with a median survival of approximately 15 months. In clinical practice, the Stupp protocol serves as the standard first-line treatment. However, patients exhibit highly heterogeneous therapeutic responses which required at least two months before first visual impact can be observed, typically with MRI. Early prediction treatment response is crucial for advancing personalized medicine. Disease Progression Modeling (DPM) aims to capture the trajectory of disease evolution, while Treatment Response Prediction (TRP) focuses on assessing the impact of therapeutic interventions. Whereas most TRP approaches primarly rely on timeseries data, we consider the problem of early visual TRP as a slice-to-slice translation model generating post-treatment MRI from a pre-treatment MRI, thus reflecting the tumor evolution. To address this problem we propose a Latent Diffusion Model with a concatenation-based conditioning from the pre-treatment MRI and the tumor localization, and a classifier-free guidance to enhance generation quality using survival information, in particular post-treatment tumor evolution. Our model were trained and tested on a local dataset consisting of 140 GBM patients collected at Centre François Baclesse. For each patient we collected pre and post T1-Gd MRI, tumor localization manually delineated in the pre-treatment MRI by medical experts, and survival information.
109. 【2510.17847】CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization
链接:https://arxiv.org/abs/2510.17847
作者:Yichen Yan,Ming Zhong,Qi Zhu,Xiaoling Gu,Jinpeng Chen,Huan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, Multimodal large, large-scale datasets remains, large language models, language capabilities
备注: 22 pages, 8 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
Abstract:Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling efficient and scalable data selection. In our experiments, we trained the CoIDO scorer using only 20 percent of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20 percent subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2 percent of the performance of full-data fine-tuning, on average.
Comments:
22 pages, 8 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2510.17847 [cs.CV]
(or
arXiv:2510.17847v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.17847
Focus to learn more
arXiv-issued DOI via DataCite</p>
110. 【2510.17845】MAT-Agent: Adaptive Multi-Agent Training Optimization
链接:https://arxiv.org/abs/2510.17845
作者:Jusheng Zhang,Kaitong Cai,Yijia Fan,Ningyuan Liu,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Multi-label image classification, evolving visual-semantic landscapes, image classification demands, conventional methods rely, Multi-label image
备注: Acceptance to NeurIPS 2025 Main Track
点击查看摘要
Abstract:Multi-label image classification demands adaptive training strategies to navigate complex, evolving visual-semantic landscapes, yet conventional methods rely on static configurations that falter in dynamic settings. We propose MAT-Agent, a novel multi-agent framework that reimagines training as a collaborative, real-time optimization process. By deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions, MAT-Agent leverages non-stationary multi-armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare-class performance, and training stability. Enhanced with dual-rate exponential moving average smoothing and mixed-precision training, it ensures robustness and efficiency. Extensive experiments across Pascal VOC, COCO, and VG-256 demonstrate MAT-Agent's superiority: it achieves an mAP of 97.4 (vs. 96.2 for PAT-T), OF1 of 92.3, and CF1 of 91.4 on Pascal VOC; an mAP of 92.8 (vs. 92.0 for HSQ-CvN), OF1 of 88.2, and CF1 of 87.1 on COCO; and an mAP of 60.9, OF1 of 70.8, and CF1 of 61.1 on VG-256. With accelerated convergence and robust cross-domain generalization, MAT-Agent offers a scalable, intelligent solution for optimizing complex visual models, paving the way for adaptive deep learning advancements.
111. 【2510.17801】Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
链接:https://arxiv.org/abs/2510.17801
作者:Yulin Luo,Chun-Kai Fan,Menghang Dong,Jiayu Shi,Mengdi Zhao,Bo-Wen Zhang,Cheng Chi,Jiaming Liu,Gaole Dai,Rongyu Zhang,Ruichuan An,Kun Wu,Zhengping Che,Shaoxuan Xie,Guocai Yao,Zhongxia Zhao,Pengwei Wang,Guang Liu,Zhongyuan Wang,Tiejun Huang,Shanghang Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:unstructured environments remains, Building robots, act in dynamic, unstructured environments, environments remains
备注:
点击查看摘要
Abstract:Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in this https URL.
112. 【1907.00330】Visual Space Optimization for Zero-shot Learning
链接:https://arxiv.org/abs/1907.00330
作者:Xinsheng Wang,Shanmin Pang,Jihua Zhu,Zhongyu Li,Zhiqiang Tian,Yaochen Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:gained popularity owing, visual space, visual, Zero-shot learning, embedding space
备注:
点击查看摘要
Abstract:Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing to its potential ability in the real-word applications. Zero-shot learning models rely on learning an embedding space, where both semantic descriptions of classes and visual features of instances can be embedded for nearest neighbor search. Recently, most of the existing works consider the visual space formulated by deep visual features as an ideal choice of the embedding space. However, the discrete distribution of instances in the visual space makes the data structure unremarkable. We argue that optimizing the visual space is crucial as it allows semantic vectors to be embedded into the visual space more effectively. In this work, we propose two strategies to accomplish this purpose. One is the visual prototype based method, which learns a visual prototype for each visual class, so that, in the visual space, a class can be represented by a prototype feature instead of a series of discrete visual features. The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive. Through extensive experimental evaluation on four benchmark datasets, we demonstrate that optimizing visual space is beneficial for zero-shot learning. Besides, the proposed prototype based method achieves the new state-of-the-art performance.
113. 【2510.18218】DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing
链接:https://arxiv.org/abs/2510.18218
作者:Luxuan Li,Xiao Wang,Chunfeng Cui
类目:Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling efficient large-scale, converts high-dimensional feature, high-dimensional feature vectors, Deep hashing converts, hashing converts high-dimensional
备注:
点击查看摘要
Abstract:Deep hashing converts high-dimensional feature vectors into compact binary codes, enabling efficient large-scale retrieval. A fundamental challenge in deep hashing stems from the discrete nature of quantization in generating the codes. W-type regularizations, such as $||z|-1|$, have been proven effective as they encourage variables toward binary values. However, existing methods often directly optimize these regularizations without convergence guarantees. While proximal gradient methods offer a promising solution, the coupling between W-type regularizers and neural network outputs results in composite forms that generally lack closed-form proximal solutions. In this paper, we present a stochastic primal-dual hashing algorithm, referred to as DualHash, that provides rigorous complexity bounds. Using Fenchel duality, we partially transform the nonconvex W-type regularization optimization into the dual space, which results in a proximal operator that admits closed-form solutions. We derive two algorithm instances: a momentum-accelerated version with $\mathcal{O}(\varepsilon^{-4})$ complexity and an improved $\mathcal{O}(\varepsilon^{-3})$ version using variance reduction. Experiments on three image retrieval databases demonstrate the superior performance of DualHash.
114. 【2510.17897】Conformal Lesion Segmentation for 3D Medical Images
链接:https://arxiv.org/abs/2510.17897
作者:Binyu Tan,Zhiyuan Wang,Jinhao Duan,Kaidi Xu,Heng Tao Shen,Xiaoshuang Shi,Fumin Shen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical image segmentation, enabling accurate localization, Medical image, image segmentation serves, precision medicine
备注:
点击查看摘要
Abstract:Medical image segmentation serves as a critical component of precision medicine, enabling accurate localization and delineation of pathological regions, such as lesions. However, existing models empirically apply fixed thresholds (e.g., 0.5) to differentiate lesions from the background, offering no statistical guarantees on key metrics such as the false negative rate (FNR). This lack of principled risk control undermines their reliable deployment in high-stakes clinical applications, especially in challenging scenarios like 3D lesion segmentation (3D-LS). To address this issue, we propose a risk-constrained framework, termed Conformal Lesion Segmentation (CLS), that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance $\varepsilon$ under desired risk levels. CLS begins by holding out a calibration set to analyze the threshold setting for each sample under the FNR tolerance, drawing on the idea of conformal prediction. We define an FNR-specific loss function and identify the critical threshold at which each calibration data point just satisfies the target tolerance. Given a user-specified risk level $\alpha$, we then determine the approximate $1-\alpha$ quantile of all the critical thresholds in the calibration set as the test-time confidence threshold. By conformalizing such critical thresholds, CLS generalizes the statistical regularities observed in the calibration set to new test data, providing rigorous FNR constraint while yielding more precise and reliable segmentations. We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
115. 【2510.17816】Cross-Domain Multi-Person Human Activity Recognition via Near-Field Wi-Fi Sensing
链接:https://arxiv.org/abs/2510.17816
作者:Xin Li,Jingzhi Hu,Yinghui He,Hongbo Wang,Jin Gan,Jun Luo
类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
关键词:thriving research field, coarse spatial resolution, spatial resolution inherent, distinguish multiple subjects, Wi-Fi significantly hinders
备注:
点击查看摘要
Abstract:Wi-Fi-based human activity recognition (HAR) provides substantial convenience and has emerged as a thriving research field, yet the coarse spatial resolution inherent to Wi-Fi significantly hinders its ability to distinguish multiple subjects. By exploiting the near-field domination effect, establishing a dedicated sensing link for each subject through their personal Wi-Fi device offers a promising solution for multi-person HAR under native traffic. However, due to the subject-specific characteristics and irregular patterns of near-field signals, HAR neural network models require fine-tuning (FT) for cross-domain adaptation, which becomes particularly challenging with certain categories unavailable. In this paper, we propose WiAnchor, a novel training framework for efficient cross-domain adaptation in the presence of incomplete activity categories. This framework processes Wi-Fi signals embedded with irregular time information in three steps: during pre-training, we enlarge inter-class feature margins to enhance the separability of activities; in the FT stage, we innovate an anchor matching mechanism for cross-domain adaptation, filtering subject-specific interference informed by incomplete activity categories, rather than attempting to extract complete features from them; finally, the recognition of input samples is further improved based on their feature-level similarity with anchors. We construct a comprehensive dataset to thoroughly evaluate WiAnchor, achieving over 90% cross-domain accuracy with absent activity categories.