本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新543篇论文,其中:

  • 自然语言处理98
  • 信息检索13
  • 计算机视觉111

自然语言处理

1. 【2510.13804】Generative Universal Verifier as Multimodal Meta-Reasoner

链接https://arxiv.org/abs/2510.13804

作者:Xinchen Zhang,Xiaoying Zhang,Youbin Wu,Yanbin Cao,Renrui Zhang,Ruihang Chu,Ling Yang,Yujiu Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:visual verification, introduce Generative Universal, reliable visual verification, providing the fundamental, visual outcomes

备注

点击查看摘要

Abstract:We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

2. 【2510.13799】BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

链接https://arxiv.org/abs/2510.13799

作者:Jia-Chen Gu,Junyi Zhang,Di Wu,Yuankai Li,Kai-Wei Chang,Nanyun Peng

类目:Computation and Language (cs.CL)

关键词:tackles complex tasks, increased cognitive load, increasingly expanded contexts, offer richer information, retrieval-augmented generation

备注: Code and data: [this https URL](https://github.com/JasonForJoy/BRIEF)

点击查看摘要

Abstract:As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead.

3. 【2510.13797】Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

链接https://arxiv.org/abs/2510.13797

作者:Giovanni Monea,Yair Feldman,Shankar Padmanabhan,Kianté Brantley,Yoav Artzi

类目:Computation and Language (cs.CL)

关键词:incurs significant memory, Transformer key-value cache, Transformer key-value, large language models, computational costs

备注

点击查看摘要

Abstract:The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.

4. 【2510.13796】he Mechanistic Emergence of Symbol Grounding in Language Models

链接https://arxiv.org/abs/2510.13796

作者:Shuyu Wu,Ziqiao Ma,Xiaoxi Luo,Yidong Huang,Josue Torres-Fonseca,Freda Shi,Joyce Chai

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world sensorimotor experiences, sensorimotor experiences, words acquire, acquire their meanings, meanings by connecting

备注

点击查看摘要

Abstract:Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

5. 【2510.13750】Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

链接https://arxiv.org/abs/2510.13750

作者:Zhiqi Huang,Vivek Datla,Chenyang Zhu,Alfy Samuel,Daben Liu,Anoop Kumar,Ritesh Soni

类目:Computation and Language (cs.CL)

关键词:large language model, retrieval-augmented generation, systems that aligns, aligns closely, correctness of large

备注: UncertaiNLP at EMNLP 2025

点击查看摘要

Abstract:We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.

6. 【2510.13749】Assessing Web Search Credibility and Response Groundedness in Chat Assistants

链接https://arxiv.org/abs/2510.13749

作者:Ivan Vykopal,Matúš Pikuliak,Simon Ostermann,Marián Šimko

类目:Computation and Language (cs.CL)

关键词:increasingly integrate web, cite external sources, web search functionality, assistants increasingly integrate, integrate web search

备注

点击查看摘要

Abstract:Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

7. 【2510.13744】Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

链接https://arxiv.org/abs/2510.13744

作者:Shrey Pandit,Austin Xu,Xuan-Phi Nguyen,Yifei Ming,Caiming Xiong,Shafiq Joty

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language model, based reasoning systems, writing mathematical proofs, receive full credit, recently achieved gold

备注: 21 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.

8. 【2510.13734】GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

链接https://arxiv.org/abs/2510.13734

作者:Xiuyuan Chen,Tao Sun,Dexin Su,Ailing Yu,Junwei Liu,Zhe Chen,Gangzeng Jin,Xin Wang,Jingnan Liu,Hansong Xiao,Hualei Zhou,Dongjie Tao,Chunxiao Guo,Minghui Yang,Yuan Xia,Jing Zhao,Qianrui Fan,Yanyun Wang,Shuai Zhen,Kezhong Chen,Jun Wang,Zewen Sun,Heng Zhao,Tian Guan,Shaodong Wang,Geyun Chang,Jiaming Deng,Hongchengcheng Chen,Kexin Feng,Ruzhen Li,Jiayi Geng,Changtai Zhao,Jun Wang,Guihu Lin,Peihao Li,Liqi Liu,Peng Wei,Jian Wang,Jinjie Gu,Ping Wang,Fan Yang

类目:Computation and Language (cs.CL)

关键词:Current benchmarks, fail to capture, textbf, based on multiple-choice, multiple-choice exams

备注

点击查看摘要

Abstract:Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.

9. 【2510.13721】NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

链接https://arxiv.org/abs/2510.13721

作者:Run Luo,Xiaobo Xia,Lu Wang,Longze Chen,Renke Shan,Jing Luo,Min Yang,Tat-Seng Chua

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:general intelligence systems, artificial general intelligence, Next-generation multimodal foundation, intelligence systems, playing a pivotal

备注

点击查看摘要

Abstract:Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

10. 【2510.13681】How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study

链接https://arxiv.org/abs/2510.13681

作者:Matthieu Dubois,François Yvon,Pablo Piantanida

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, attracted growing attention, generated by Large, Language Models

备注: EMNLP 2025 Findings

点击查看摘要

Abstract:As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework this https URL

11. 【2510.13632】Closing the Gap Between Text and Speech Understanding in LLMs

链接https://arxiv.org/abs/2510.13632

作者:Santiago Cuervo,Skyler Seto,Maureen de Seyssel,Richard He Bai,Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly,Zakaria Aldeneh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词:Large Language Models, Large Language, adapted to extend, Large, language understanding

备注

点击查看摘要

Abstract:Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

12. 【2510.13626】LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

链接https://arxiv.org/abs/2510.13626

作者:Senyu Fei,Siyin Wang,Junhao Shi,Zihao Dai,Jikun Cai,Pengfang Qian,Li Ji,Xinzhe He,Shiduo Zhang,Zhaoye Fei,Jinlan Fu,Jingjing Gong,Xipeng Qiu

类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:report impressive success, impressive success rates, mask fundamental weaknesses, robotic manipulation benchmarks, models report impressive

备注

点击查看摘要

Abstract:Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.

13. 【2510.13624】Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses

链接https://arxiv.org/abs/2510.13624

作者:Stefan Lenz,Lakisha Ortiz Rosario,Georg Vollmar,Arsenij Ustjanzew,Fatma Alickovic,Thomas Kindler,Torsten Panholzer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Accurate coding, structured cancer documentation, essential for structured, structured cancer, accuracy

备注: 19 pages, 4 figures

点击查看摘要

Abstract:Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from this https URL.

14. 【2510.13614】MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning

链接https://arxiv.org/abs/2510.13614

作者:Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, evolving event sequences, impressive reasoning abilities, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive reasoning abilities, but struggle with temporal understanding, especially when questions involve multiple entities, compound operators, and evolving event sequences. Temporal Knowledge Graphs (TKGs), which capture vast amounts of temporal facts in a structured format, offer a reliable source for temporal reasoning. However, existing TKG-based LLM reasoning methods still struggle with four major challenges: maintaining temporal faithfulness in multi-hop reasoning, achieving multi-entity temporal synchronization, adapting retrieval to diverse temporal operators, and reusing prior reasoning experience for stability and efficiency. To address these issues, we propose MemoTime, a memory-augmented temporal knowledge graph framework that enhances LLM reasoning through structured grounding, recursive reasoning, and continual experience learning. MemoTime decomposes complex temporal questions into a hierarchical Tree of Time, enabling operator-aware reasoning that enforces monotonic timestamps and co-constrains multiple entities under unified temporal bounds. A dynamic evidence retrieval layer adaptively selects operator-specific retrieval strategies, while a self-evolving experience memory stores verified reasoning traces, toolkit decisions, and sub-question embeddings for cross-type reuse. Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%. Furthermore, MemoTime enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.

15. 【2510.13602】NOSA: Native and Offloadable Sparse Attention

链接https://arxiv.org/abs/2510.13602

作者:Yuxiang Huang,Chaojun Xiao,Xu Han,Zhiyuan Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:significantly saving memory, Trainable sparse attention, saving memory accesses, minimally impacting task, decoding efficiency bottleneck

备注: Preprint

点击查看摘要

Abstract:Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).

16. 【2510.13598】FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

链接https://arxiv.org/abs/2510.13598

作者:Kristýna Onderková,Ondřej Plátek,Zdeněk Kasner,Ondřej Dušek

类目:Computation and Language (cs.CL)

关键词:Large Language Model, task that requires, requires precision, precision in analyzing, Language Model

备注: To be published in INLG 2025

点击查看摘要

Abstract:Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.

17. 【2510.13586】Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

链接https://arxiv.org/abs/2510.13586

作者:Pasin Buakhaw,Kun Kerdthaisong,Phuree Phenhiran,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:ating dynamic non-player, dynamic non-player characters, persona-consistent dialogue generation, tional task execution, enabling both func

备注

点击查看摘要

Abstract:The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).

18. 【2510.13580】Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models

链接https://arxiv.org/abs/2510.13580

作者:Daniil Gurgurov,Josef van Genabith,Simon Ostermann

类目:Computation and Language (cs.CL)

关键词:exhibit uneven performance, Large language models, models exhibit uneven, Large language, Activation Probability Entropy

备注: preprint

点击查看摘要

Abstract:Large language models exhibit uneven performance across languages, with substantial gaps between high- and low-resource languages. We present a framework for enhancing monolingual capabilities of LLMs in underrepresented languages while preserving their general-purpose performance through targeted fine-tuning of language-specific subnetworks. Our approach identifies language-specific neurons using Language Activation Probability Entropy and fine-tunes only the weights associated with these neurons, a dedicated subnetwork, on target-language data. Experiments on Llama-3.1-8B and Mistral-Nemo-12B across 12 mid- and low-resource languages demonstrate that our method consistently outperforms full fine-tuning, FFN-only fine-tuning, LoRA adaptation, and random subset fine-tuning baselines while efficiently updating only up to 1% of model parameters. Beyond performance improvements, we observe enhanced favorable training dynamics, cross-lingual representational alignment, and systematic weight update changes. To facilitate future research, we release language-specific neuron identifications for over 100 languages as well as our adaptation pipeline, offering a cost-effective pathway for adapting state-of-the-art models to underrepresented languages.

19. 【2510.13554】Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

链接https://arxiv.org/abs/2510.13554

作者:Yang Li,Zhichen Dong,Yuhan Sun,Weixun Wang,Shaopan Xiong,Yijia Luo,Jiashun Liu,Han Lu,Jiamang Wang,Wenbo Su,Bo Zheng,Junchi Yan

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:typically applies uniform, Large language models, Reinforcement learning, Large language, applies uniform credit

备注: 23 pages, 8 figures, 5 tables

点击查看摘要

Abstract:The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

20. 【2510.13537】K-Merge: Online Continual Merging of Adapters for On-device Large Language Models

链接https://arxiv.org/abs/2510.13537

作者:Donald Shenaj,Ondrej Bohdal,Taha Ceritli,Mete Ozay,Pietro Zanuttigh,Umberto Michieli

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, frequently leverages Low-Rank, tight resource constraints, leverages Low-Rank Adapters, deployment of Large

备注: 15 pages, 8 figures

点击查看摘要

Abstract:On-device deployment of Large Language Models (LLMs) frequently leverages Low-Rank Adapters (LoRAs) to support diverse downstream tasks under tight resource constraints. To address the limited storage capacity of mobile devices, recent works have explored model merging techniques to fuse multiple LoRAs into a single one. In practice, however, LoRAs are often delivered incrementally, as users request support for new tasks (e.g., novel problem types or languages). This scenario introduces a new challenge: on-device online continual merging, where the objective is to incorporate new LoRAs while preserving the performance on previously supported tasks. In this paper, we propose a data-free and computationally efficient strategy for selecting and merging LoRAs when a new one becomes available, assuming the device can store only a limited number of adapters. Extensive experiments across real-world tasks demonstrate the superiority of our approach compared to alternative strategies while adhering to the storage budget and compute limitations of on-device settings.

21. 【2510.13500】MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts

链接https://arxiv.org/abs/2510.13500

作者:Shujun Xia,Haokun Lin,Yichen Wu,Yinan Zhou,Zixuan Li,Zhongwei Wan,Xingrun Xing,Yefeng Zheng,Xiang Li,Caifeng Shan,Zhenan Sun,Quanzheng Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:high-stakes clinical practice, hold great promise, LLMs hold great, limiting their applicability, clinical practice

备注: Preprint, work in progress

点击查看摘要

Abstract:LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, \hk{an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints}. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at this https URL.

22. 【2510.13499】ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

链接https://arxiv.org/abs/2510.13499

作者:Xiaozhe Li,TianYi Lyu,Siyi Yang,Yuxi Gong,Yizhao Yang,Jinxuan Huang,Ligao Zhang,Zhuoyi Huang,Qingwen Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:requiring analytical reasoning, large language models, dynamic information aggregation, contextual interpretation, high-level task

备注

点击查看摘要

Abstract:Understanding human intent is a complex, high-level task for large language models (LLMs), requiring analytical reasoning, contextual interpretation, dynamic information aggregation, and decision-making under uncertainty. Real-world public discussions, such as consumer product discussions, are rarely linear or involve a single user. Instead, they are characterized by interwoven and often conflicting perspectives, divergent concerns, goals, emotional tendencies, as well as implicit assumptions and background knowledge about usage scenarios. To accurately understand such explicit public intent, an LLM must go beyond parsing individual sentences; it must integrate multi-source signals, reason over inconsistencies, and adapt to evolving discourse, similar to how experts in fields like politics, economics, or finance approach complex, uncertain environments. Despite the importance of this capability, no large-scale benchmark currently exists for evaluating LLMs on real-world human intent understanding, primarily due to the challenges of collecting real-world public discussion data and constructing a robust evaluation pipeline. To bridge this gap, we introduce \bench, the first dynamic, live evaluation benchmark specifically designed for intent understanding, particularly in the consumer domain. \bench is the largest and most diverse benchmark of its kind, supporting real-time updates while preventing data contamination through an automated curation pipeline.

23. 【2510.13494】LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

链接https://arxiv.org/abs/2510.13494

作者:Tommaso Bonomo,Luca Gioffré,Roberto Navigli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Question Answering, narrative text poses, requiring a deep, understanding of long, poses a unique

备注: Accepted to EMNLP 2025 Main Conference. 22 pages

点击查看摘要

Abstract:Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at this https URL.

24. 【2510.13434】Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

链接https://arxiv.org/abs/2510.13434

作者:Hao Wang,Linlong Xu,Heng Liu,Yangyang Liu,Xiaohu Zhao,Bo Zeng,Liangying Shao,Longyue Wang,Weihua Luo,Kaifu Zhang

类目:Computation and Language (cs.CL)

关键词:aligning Large Language, Large Language Models, Direct Preference Optimization, Large Language, inefficient data utilization

备注

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.

25. 【2510.13430】Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

链接https://arxiv.org/abs/2510.13430

作者:Ahmed Alzubaidi,Shaikha Alsuwaidi,Basma El Amel Boussaha,Leen AlQadi,Omar Alkaabi,Mohammed Alyafeai,Hamza Alobeidli,Hakim Hacid

类目:Computation and Language (cs.CL)

关键词:Arabic LLM benchmarks, NLP tasks, Arabic LLM, knowledge domains, specialized capabilities

备注

点击查看摘要

Abstract:This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

26. 【2510.13417】Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse

链接https://arxiv.org/abs/2510.13417

作者:Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:causal, intermediate causal steps, causal steps explain, intermediate causal, causal steps

备注

点击查看摘要

Abstract:How does a cause lead to an effect, and which intermediate causal steps explain their connection? This work scrutinizes the mechanistic causal reasoning capabilities of large language models (LLMs) to answer these questions through the task of implicit causal chain discovery. In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures. These pairs are drawn from recent resources in argumentation studies featuring polarized discussion on climate change. Our analysis reveals that LLMs vary in the number and granularity of causal steps they produce. Although they are generally self-consistent and confident about the intermediate causal connections in the generated chains, their judgments are mainly driven by associative pattern matching rather than genuine causal reasoning. Nonetheless, human evaluations confirmed the logical coherence and integrity of the generated chains. Our baseline causal chain discovery approach, insights from our diagnostic evaluation, and benchmark dataset with causal chains lay a solid foundation for advancing future work in implicit, mechanistic causal reasoning in argumentation settings.

27. 【2510.13407】Investigating Lexical Change through Cross-Linguistic Colexification Patterns

链接https://arxiv.org/abs/2510.13407

作者:Kim Gfeller,Sabine Stoll,Chundra Cathcart,Paul Widmer

类目:Computation and Language (cs.CL)

关键词:intriguing features, ongoing shifts, concept pairs, meanings evolve remain, constant change

备注

点击查看摘要

Abstract:One of the most intriguing features of language is its constant change, with ongoing shifts in how meaning is expressed. Despite decades of research, the factors that determine how and why meanings evolve remain only partly understood. Colexification -- the phenomenon of expressing multiple distinct concepts using the same word form -- serves as a valuable window onto the dynamics of meaning change across languages. Here, we apply phylogenetic comparative models to dictionary data from three language families, Austronesian, Indo-European, and Uralic, in order to shed light on the evolutionary dynamics underlying the colexification of concept pairs. We assess the effects of three predictors: associativity, borrowability, and usage frequency. Our results show that more closely related concept pairs are colexified across a larger portion of the family tree and exhibit slower rates of change. In contrast, concept pairs that are more frequent and more prone to borrowing tend to change more rapidly and are less often colexified. We also find considerable differences between the language families under study, suggesting that areal and cultural factors may play a role.

28. 【2510.13395】Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

链接https://arxiv.org/abs/2510.13395

作者:Agnese Lombardi,Alessandro Lenci

类目:Computation and Language (cs.CL)

关键词:Language is fundamental, human cooperation, fundamental to human, exchange of information, shared interpretations

备注

点击查看摘要

Abstract:Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.

29. 【2510.13387】Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

链接https://arxiv.org/abs/2510.13387

作者:Buwei He,Yang Liu,Zhaowei Zhang,Zixia Jia,Huijia Wu,Zhaofeng He,Zilong Zheng,Yipeng Kang

类目:Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)

关键词:fundamental social capability, remains a challenge, fundamental social, social capability, large language models

备注

点击查看摘要

Abstract:Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (LLMs). Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment. In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of LLMs. Our framework incorporates a commitment-communication mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update. We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework. This framework covers a diverse set of persuadees -- including LLM instances with varying prompts and fine-tuning and human participants -- across tasks ranging from specially designed persuasion scenarios to general everyday situations. Experimental results on LLM-based agents reveal three main findings: (1) LLMs guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.

30. 【2510.13366】Document Intelligence in the Era of Large Language Models: A Survey

链接https://arxiv.org/abs/2510.13366

作者:Weishi Wang,Hengchang Hu,Zhijie Zhang,Zhaochen Li,Hongxin Shao,Daniel Dahlmeier

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, vital application area, significantly transformed, advent of large, large language

备注

点击查看摘要

Abstract:Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

31. 【2510.13363】D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree

链接https://arxiv.org/abs/2510.13363

作者:Xiang Lei,Qin Li,Min Zhang,Min Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, exhibit factual inconsistencies, Language Models, multi-turn dialogue consistency

备注: 8 pages, 6 figures (main content); 25 pages, 18 figures (total)

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues, a challenge stemming from their reliance on static, pre-trained knowledge and an inability to reason adaptively over the dialogue history. Prevailing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and agentic working memories, improve information recall but still engage with fundamentally static knowledge sources and follow pre-defined single reasoning path. This hinders their ability to preserve factual and logical consistency of their responses in multi-turn dialogues while the context evolves over time. To address this issue, we propose D-SMART, a model-agnostic framework designed to maintain multi-turn dialogue consistency by enabling LLMs to build and reason over a dynamic, structured representation of the conversational context. This is achieved via two synergistic components: (1) a Dynamic Structured Memory (DSM), which incrementally constructs and maintains an authoritative, OWL-compliant knowledge graph of the conversation; and (2) a Reasoning Tree (RT), which executes inferences as an explicit and traceable multi-step search over the graph. As the popular-used quality score (judged by GPT-4) can overlook logical flaws, we introduce new NLI-based metrics to better measure multi-turn dialogue consistency. Comprehensive experiments on the MT-Bench-101 benchmark show that D-SMART significantly outperforms state-of-the-art baselines, elevating the dialogue consistency score by over 48\% for both proprietary and open-source models, and notably improves the quality score of the latter by up to 10.1\%.

32. 【2510.13357】Personal Attribute Leakage in Federated Speech Models

链接https://arxiv.org/abs/2510.13357

作者:Hamdan Al-Ali,Ali Reza Ghavamipour,Tommaso Caselli,Fatih Turkmen,Zeerak Talat,Hanan Aldarmaki

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:machine learning models, ASR models, machine learning, privacy-preserving training, training of machine

备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Federated learning is a common method for privacy-preserving training of machine learning models. In this paper, we analyze the vulnerability of ASR models to attribute inference attacks in the federated setting. We test a non-parametric white-box attack method under a passive threat model on three ASR models: Wav2Vec2, HuBERT, and Whisper. The attack operates solely on weight differentials without access to raw speech from target speakers. We demonstrate attack feasibility on sensitive demographic and clinical attributes: gender, age, accent, emotion, and dysarthria. Our findings indicate that attributes that are underrepresented or absent in the pre-training data are more vulnerable to such inference attacks. In particular, information about accents can be reliably inferred from all models. Our findings expose previously undocumented vulnerabilities in federated ASR models and offer insights towards improved security.

33. 【2510.13351】Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

链接https://arxiv.org/abs/2510.13351

作者:Karthik Avinash,Nikhil Pareek,Rishav Hada

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, enterprise and mission-critical, mission-critical domains

备注

点击查看摘要

Abstract:The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.

34. 【2510.13344】UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

链接https://arxiv.org/abs/2510.13344

作者:Zhenyu Liu,Yunxin Li,Xuanyu Zhang,Qixun Teng,Shenyuan Jiang,Xinyu Chen,Haoyuan Shi,Jinchao Li,Qi Wang,Haolan Chen,Fanbo Meng,Mingjun Zhao,Yu Xu,Yancheng He,Baotian Hu,Min Zhang

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:Recent advances, comprehensive content generation, clear trend, trend towards comprehensive, comprehensive content

备注

点击查看摘要

Abstract:Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: this https URL

35. 【2510.13341】Are Proverbs the New Pythian Oracles? Exploring Sentiment in Greek Sayings

链接https://arxiv.org/abs/2510.13341

作者:Katerina Korre,John Pavlopoulos

类目:Computation and Language (cs.CL)

关键词:fascinating linguistic phenomena, Natural Language Processing, linguistic boundaries, fascinating linguistic, linguistic phenomena

备注

点击查看摘要

Abstract:Proverbs are among the most fascinating linguistic phenomena that transcend cultural and linguistic boundaries. Yet, much of the global landscape of proverbs remains underexplored, as many cultures preserve their traditional wisdom within their own communities due to the oral tradition of the phenomenon. Taking advantage of the current advances in Natural Language Processing (NLP), we focus on Greek proverbs, analyzing their sentiment. Departing from an annotated dataset of Greek proverbs, we expand it to include local dialects, effectively mapping the annotated sentiment. We present (1) a way to exploit LLMs in order to perform sentiment classification of proverbs, (2) a map of Greece that provides an overview of the distribution of sentiment, (3) a combinatory analysis in terms of the geographic position, dialect, and topic of proverbs. Our findings show that LLMs can provide us with an accurate enough picture of the sentiment of proverbs, especially when approached as a non-conventional sentiment polarity task. Moreover, in most areas of Greece negative sentiment is more prevalent.

36. 【2510.13334】aming the Fragility of KV Cache Eviction in LLM Inference

链接https://arxiv.org/abs/2510.13334

作者:Yuan Feng,Haoyu Guo,JunLin Lv,S. Kevin Zhou,Xike Xie

类目:Computation and Language (cs.CL)

关键词:Large language models, natural language processing, revolutionized natural language, Large language, deployment remains hampered

备注

点击查看摘要

Abstract:Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer's Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the stability assumption-that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management. Our code is available at this https URL.

37. 【2510.13329】Embedding-Based Context-Aware Reranker

链接https://arxiv.org/abs/2510.13329

作者:Ye Yuan,Mohammad Amin Shabani,Siqi Liu

类目:Computation and Language (cs.CL)

关键词:support downstream generation, Retrieval-Augmented Generation, downstream generation, retrieving relevant evidence, systems rely

备注: Under Review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation. The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval. However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources. Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges. Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document. We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.

38. 【2510.13312】ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

链接https://arxiv.org/abs/2510.13312

作者:Simon Lupart,Mohammad Aliannejadi,Evangelos Kanoulas

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:conversational question answering, reasoning framework based, reinforcement learning, question answering, framework based

备注

点击查看摘要

Abstract:We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.

39. 【2510.13302】LLM one-shot style transfer for Authorship Attribution and Verification

链接https://arxiv.org/abs/2510.13302

作者:Pablo Miralles-González,Javier Huertas-Tato,Alejandro Martín,David Camacho

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:stylometry analyzes writing, analyzes writing style, Computational stylometry analyzes, supporting applications, stylometry analyzes

备注

点击查看摘要

Abstract:Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.

40. 【2510.13293】Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

链接https://arxiv.org/abs/2510.13293

作者:Yizhou Peng,Yukun Ma,Chong Zhang,Yi-Wen Chao,Chongjia Ni,Bin Ma

类目:Computation and Language (cs.CL)

关键词:achieve fine-grained control, significant challenge emerges, TTS models, TTS, systems can achieve

备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.

41. 【2510.13291】Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

链接https://arxiv.org/abs/2510.13291

作者:Xuxin Cheng,Ke Zeng,Zhiquan Cao,Linyi Dai,Wenxuan Gao,Fei Han,Ai Jian,Feng Hong,Wenxing Hu,Zihe Huang,Dejian Kong,Jia Leng,Zhuoyuan Liao,Pei Liu,Jiaye Lin,Xing Ma,Jingqing Ruan,Jiaxing Song,Xiaoyu Tan,Ruixuan Xiao,Wenhui Yu,Wenyu Zhan,Haoxing Zhang,Chao Zhou,Hao Zhou,Shaodong Zheng,Ruinian Chen,Siyuan Chen,Ziyang Chen,Yiwen Dong,Yaoyou Fan,Yangyi Fang,Yang Gan,Shiguang Guo,Qi He,Chaowen Hu,Binghui Li,Dailin Li,Xiangyu Li,Yan Li,Chengjian Liu,Xiangfeng Liu,Jiahui Lv,Qiao Ma,Jiang Pan,Cong Qin,Chenxing Sun,Wen Sun,Zhonghui Wang,Abudukelimu Wuerkaixi,Xin Yang,Fangyi Yuan,Yawen Zhu,Tianyi Zhai,Jie Zhang,Runlai Zhang,Yao Xu,Yiran Zhao,Yifan Wang,Xunliang Cai,Yangen Hu,Cao Liu,Lu Pan,Xiaoli Wang,Bo Xiao,Wenyuan Yao,Qianlin Zhou,Benchang Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Enhancing customer experience, Large Language Models, Enhancing customer, service demands grow, intelligent interaction systems

备注: 36 pages, 14 figures

点击查看摘要

Abstract:Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.

42. 【2510.13285】In-Distribution Steering: Balancing Control and Coherence in Language Model Generation

链接https://arxiv.org/abs/2510.13285

作者:Arthur Vogels,Benjamin Wong,Yann Choho,Annabelle Blangero,Milan Bhan

类目:Computation and Language (cs.CL)

关键词:large language model, modifying internal activations, Activation steering methods, control large language, language model

备注

点击查看摘要

Abstract:Activation steering methods control large language model (LLM) behavior by modifying internal activations at inference time. However, most existing activation steering methods rely on a fixed steering strength, leading to either insufficient control or unadapted intervention that degrades text plausibility and coherence. We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space. IDS dynamically adjusts interventions according to how far a given input lies within the distribution, enabling adaptive intervention and generation stability during text generation. Experiments demonstrate that IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.

43. 【2510.13276】MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

链接https://arxiv.org/abs/2510.13276

作者:Keyan Zhou,Zecheng Tang,Lingfeng Ming,Guanghao Zhou,Qiguang Chen,Dan Qiao,Zheming Yang,Libo Qin,Minghui Qiu,Juntao Li,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:large vision language, vision language models, rapid advancement, advancement of large, large vision

备注

点击查看摘要

Abstract:The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

44. 【2510.13272】Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.13272

作者:Zhichao Xu,Zongyu Wu,Yun Zhou,Aosong Feng,Kang Zhou,Sangmin Woo,Kiran Ramnath,Yijun Tian,Xuan Qi,Weikang Qiu,Lin Lee Cheong,Haibo Ding

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, train LLMs, Language Model, training for domains

备注

点击查看摘要

Abstract:Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.

45. 【2510.13271】Do You Get the Hint? Benchmarking LLMs on the Board Game Concept

链接https://arxiv.org/abs/2510.13271

作者:Ine Gevers,Walter Daelemans

类目:Computation and Language (cs.CL)

关键词:expose fundamental weaknesses, achieved striking successes, recent studies continue, Large language models, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved striking successes on many benchmarks, yet recent studies continue to expose fundamental weaknesses. In particular, tasks that require abstract reasoning remain challenging, often because they use representations such as grids, symbols, or visual patterns that differ from the natural language data LLMs are trained on. In this paper, we introduce Concept, a simple word-guessing board game, as a benchmark for probing abductive reasoning in a representation that is much closer to LLM pre-training data: natural language. Our results show that this game, easily solved by humans (with a success rate of over 90\%), is still very challenging for state-of-the-art LLMs (no model exceeds 40\% success rate). Specifically, we observe that LLMs struggle with interpreting other players' strategic intents, and with correcting initial hypotheses given sequential information updates. In addition, we extend the evaluation across multiple languages, and find that the LLM performance drops further in lower-resource languages (Dutch, French, and Spanish) compared to English.

46. 【2510.13255】Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain

链接https://arxiv.org/abs/2510.13255

作者:Jingmin An,Yilong Song,Ruolin Yang,Nai Ding,Lingxi Lu,Yuxuan Wang,Wei Wang,Chu Zhuang,Qian Wang,Fang Fang

类目:Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词:responsible remain unclear, modules responsible remain, Large Language Models, superior language abilities, effectively modeling syntactic

备注

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate human-level or even superior language abilities, effectively modeling syntactic structures, yet the specific computational modules responsible remain unclear. A key question is whether LLM behavioral capabilities stem from mechanisms akin to those in the human brain. To address these questions, we introduce the Hierarchical Frequency Tagging Probe (HFTP), a tool that utilizes frequency-domain analysis to identify neuron-wise components of LLMs (e.g., individual Multilayer Perceptron (MLP) neurons) and cortical regions (via intracranial recordings) encoding syntactic structures. Our results show that models such as GPT-2, Gemma, Gemma 2, Llama 2, Llama 3.1, and GLM-4 process syntax in analogous layers, while the human brain relies on distinct cortical regions for different syntactic levels. Representational similarity analysis reveals a stronger alignment between LLM representations and the left hemisphere of the brain (dominant in language processing). Notably, upgraded models exhibit divergent trends: Gemma 2 shows greater brain similarity than Gemma, while Llama 3.1 shows less alignment with the brain compared to Llama 2. These findings offer new insights into the interpretability of LLM behavioral improvements, raising questions about whether these advancements are driven by human-like or non-human-like mechanisms, and establish HFTP as a valuable tool bridging computational linguistics and cognitive neuroscience. This project is available at this https URL.

47. 【2510.13220】EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

链接https://arxiv.org/abs/2510.13220

作者:Yufei He,Juncheng Liu,Yue Liu,Yibo Li,Tri Cao,Zhiyuan Hu,Xinxing Xu,Bryan Hooi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:test time, clever but clueless, clueless interns, fundamental limitation, limitation of current

备注

点击查看摘要

Abstract:A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

48. 【2510.13215】Personalized Learning Path Planning with Goal-Driven Learner State Modeling

链接https://arxiv.org/abs/2510.13215

作者:Joy Jia Yin Lim,Ye He,Jifan Yu,Xin Cong,Daniel Zhang-Li,Zhiyuan Liu,Huiqin Liu,Lei Hou,Juanzi Li,Bin Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Learning Path Planning, individual goals, adaptive learning paths, align with individual, design adaptive learning

备注

点击查看摘要

Abstract:Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore's effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset to facilitate future research.

49. 【2510.13211】A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

链接https://arxiv.org/abs/2510.13211

作者:Prawaal Sharma,Navneet Goyal,Poonam Goyal,Vishnupriyan R

类目:Computation and Language (cs.CL)

关键词:good quality digital, quality digital language, Linguistic diversity, digital language resources, human population

备注: 4 Pages, Parallel Data Augmentation

点击查看摘要

Abstract:Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.

50. 【2510.13202】LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems

链接https://arxiv.org/abs/2510.13202

作者:Sai Suhruth Reddy Karri,Yashwanth Sai Nallapuneni,Laxmi Narasimha Reddy Mallireddy,Gopichand G

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language data, raises ethical, practical concerns, relying on natural, ethical and practical

备注: 11 pages, 4 figures, 1 Table, submitted to an international conference

点击查看摘要

Abstract:Bias in AI systems, especially those relying on natural language data, raises ethical and practical concerns. Underrepresentation of certain groups often leads to uneven performance across demographics. Traditional fairness methods, such as pre-processing, in-processing, and post-processing, depend on protected-attribute labels, involve accuracy-fairness trade-offs, and may not generalize across datasets. To address these challenges, we propose LLM-Guided Synthetic Augmentation (LGSA), which uses large language models to generate counterfactual examples for underrepresented groups while preserving label integrity. We evaluated LGSA on a controlled dataset of short English sentences with gendered pronouns, professions, and binary classification labels. Structured prompts were used to produce gender-swapped paraphrases, followed by quality control including semantic similarity checks, attribute verification, toxicity screening, and human spot checks. The augmented dataset expanded training coverage and was used to train a classifier under consistent conditions. Results show that LGSA reduces performance disparities without compromising accuracy. The baseline model achieved 96.7 percent accuracy with a 7.2 percent gender bias gap. Simple swap augmentation reduced the gap to 0.7 percent but lowered accuracy to 95.6 percent. LGSA achieved 99.1 percent accuracy with a 1.9 percent bias gap, improving performance on female-labeled examples. These findings demonstrate that LGSA is an effective strategy for bias mitigation, enhancing subgroup balance while maintaining high task accuracy and label fidelity.

51. 【2510.13197】xt Anomaly Detection with Simplified Isolation Kernel

链接https://arxiv.org/abs/2510.13197

作者:Yang Cao,Sikun Yang,Yujiu Yang,Lianyong Qi,Ming Liu

类目:Computation and Language (cs.CL)

关键词:Two-step approaches combining, approaches combining pre-trained, leveraging rich semantic, Two-step approaches, combining pre-trained large

备注: EMNLP Findings 2025

点击查看摘要

Abstract:Two-step approaches combining pre-trained large language model embeddings and anomaly detectors demonstrate strong performance in text anomaly detection by leveraging rich semantic representations. However, high-dimensional dense embeddings extracted by large language models pose challenges due to substantial memory requirements and high computation time. To address this challenge, we introduce the Simplified Isolation Kernel (SIK), which maps high-dimensional dense embeddings to lower-dimensional sparse representations while preserving crucial anomaly characteristics. SIK has linear time complexity and significantly reduces space complexity through its innovative boundary-focused feature mapping. Experiments across 7 datasets demonstrate that SIK achieves better detection performance than 11 state-of-the-art (SOTA) anomaly detection algorithms while maintaining computational efficiency and low memory cost. All code and demonstrations are available at this https URL.

52. 【2510.13194】StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

链接https://arxiv.org/abs/2510.13194

作者:Xi Chen,Yuchen Song,Satoshi Nakamura

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:cross-lingual emphasis conversion, propose a stress-aware, system that preserves, preserves word-level emphasis, preserves word-level

备注

点击查看摘要

Abstract:We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the "LLM-as-Judge" for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.

53. 【2510.13191】Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2510.13191

作者:Jiamin Chen,Yuchen Li,Xinyu Ma,Xinran Chen,Xiaokun Zhang,Shuaiqiang Wang,Chen Ma,Dawei Yin

类目:Computation and Language (cs.CL)

关键词:large language models, language models, essential approach, approach for extending, knowledge capacity

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.

54. 【2510.13190】SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

链接https://arxiv.org/abs/2510.13190

作者:Juan Ren,Mark Dras,Usman Naseem

类目:Computation and Language (cs.CL)

关键词:Large Vision-Language Models, unlock powerful multimodal, powerful multimodal reasoning, conceal harmful goals, Large Vision-Language

备注: Preprint

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.

55. 【2510.13183】DSCD: Large Language Model Detoxification with Self-Constrained Decoding

链接https://arxiv.org/abs/2510.13183

作者:Ming Dong,Jinkui Zhang,Bolong Zheng,Xinhui Tu,Po Hu,Tingting He

类目:Computation and Language (cs.CL)

关键词:large language models, significant research challenge, language models, remains a significant, research challenge

备注: Accepted at EMNLP 2025 MainConference

点击查看摘要

Abstract:Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLM detoxification without parameter fine-tuning. DSCD strengthens the inner next-token distribution of the safety layer while weakening that of hallucination and toxic layers during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD's effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD's potential as a practical and scalable solution for safer LLM deployments.

56. 【2510.13170】Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism

链接https://arxiv.org/abs/2510.13170

作者:Xiaoshu Chen,Sihang Zhou,Ke Liang,Duanyang Yuan,Haoyuan Chen,Xiaoyu Sun,Linyuan Meng,Xinwang Liu

类目:Computation and Language (cs.CL)

关键词:endow large language, Chain of thought, curated reasoning traces, CoT fine-tuning, large language models

备注

点击查看摘要

Abstract:Chain of thought (CoT) fine-tuning aims to endow large language models (LLMs) with reasoning capabilities by training them on curated reasoning traces. It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc. As CoT fine-tuning has advanced, LLMs have demonstrated substantial improvements in tasks such as mathematical reasoning and code generation. However, existing surveys about CoT fine-tuning primarily focus on technical aspects and overlook a systematic analysis from the perspective of human reasoning mechanisms. Given that the ultimate goal of CoT fine-tuning is to enable LLMs to reason like humans, it is crucial to investigate this technique through the lens of human cognition. To fill this gap, we present the first comprehensive survey of CoT fine-tuning grounded in human reasoning theory. Specifically, inspired by the well-known Six Thinking Hats framework, which systematically characterizes common human thinking modes using six metaphorical hats, we classify and examine CoT fine-tuning methods through this lens. Furthermore, building upon this theory, we outline potential directions for future research in CoT fine-tuning. In addition, we compile a comprehensive overview of existing datasets and model performances, and a real-time GitHub repository \footnote{this https URL} that continuously tracks recent advances in this area is maintained. We hope this survey will serve as a valuable resource to inspire innovation and foster progress in this rapidly evolving field.

57. 【2510.13166】CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning

链接https://arxiv.org/abs/2510.13166

作者:Kehua Feng,Keyan Ding,Zhihui Zhu,Lei Liang,Qiang Zhang,Huajun Chen

类目:Computation and Language (cs.CL)

关键词:advanced large language, large language models, specialized knowledge requirements, general reasoning tasks, superficial reasoning due

备注: 28 pages, 3 figures

点击查看摘要

Abstract:While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.

58. 【2510.13163】A Matter of Representation: Towards Graph-Based Abstract Code Generation

链接https://arxiv.org/abs/2510.13163

作者:Nyx Iskandar,Hisham Bedri,Andy Tsen

类目:Computation and Language (cs.CL)

关键词:large language models, abstract code generation, graph-based abstract code, today excel, abstract code

备注

点击查看摘要

Abstract:Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets. In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations. We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task. All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.

59. 【2510.13161】Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

链接https://arxiv.org/abs/2510.13161

作者:Nikhil Bhendawade,Kumari Nishu,Arnav Kundu,Chris Bartels,Minsik Cho,Irina Belousova

类目:Computation and Language (cs.CL)

关键词:decoding accelerates LLM, accelerates LLM inference, accelerates LLM, autoregressive draft generation, increasing draft size

备注

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.

60. 【2510.13157】Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval

链接https://arxiv.org/abs/2510.13157

作者:Subhendu Khatuya,Shashwat Naidu,Pawan Goyal,Niloy Ganguly

类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, challenging area, numerical reasoning remains, continuous advancements, large language

备注: This work has been accepted for publication in the Main Conference of the Empirical Methods in Natural Language Processing (EMNLP) 2025

点击查看摘要

Abstract:Despite continuous advancements in the capabilities of large language models (LLMs), numerical reasoning remains a challenging area. Techniques like chain-of-thought prompting, tree-of-thought prompting, and program-of-thought prompting guide LLMs through intermediate reasoning steps. Although in-context learning with few-shot prompting has improved performance, LLMs still lag behind state-of-the-art models on financial numerical reasoning datasets such as FinQA and ConvFinQA. In this work, we introduce FINDER, a novel two-step framework, to enhance LLMs' capabilities in financial numerical reasoning. The first step utilizes a generative retriever to extract relevant facts from unstructured data, including both text and tables. This is followed by context-aware Program of Thought prompting with dynamic selection of in-context examples. Our model FINDER achieves a new state-of-the-art performance on both the FinQA and ConvFinQA datasets, surpassing previous benchmarks with execution accuracy improvements of 5.98% and 4.05%, respectively.

61. 【2510.13154】I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs

链接https://arxiv.org/abs/2510.13154

作者:Pardis Sadat Zahraei,Ehsaneddin Asgari

类目:Computation and Language (cs.CL)

关键词:North Africa, Middle East, East and North, evaluation efforts, benchmark designed

备注

点击查看摘要

Abstract:We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs) with respect to the beliefs and values of the Middle East and North Africa (MENA) region, an underrepresented area in current AI evaluation efforts. Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries. To probe LLM behavior, we evaluate diverse models across multiple conditions formed by crossing three perspective framings (neutral, personalized, and third-person/cultural observer) with two language modes (English and localized native languages: Arabic, Persian, Turkish). Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden preferences. We further demonstrate that models collapse into simplistic linguistic categories when operating in native languages, treating diverse nations as monolithic entities. MENAValues offers a scalable framework for diagnosing cultural misalignment, providing both empirical insights and methodological tools for developing more culturally inclusive AI.

62. 【2510.13143】Stable LLM Ensemble: Interaction between Example Representativeness and Diversity

链接https://arxiv.org/abs/2510.13143

作者:Junichiro Niimi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, achieved remarkable results, Large language, range of domains, achieved remarkable

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable results in wide range of domains. However, the accuracy and robustness of one-shot LLM predictions remain highly sensitive to the examples and the diversity among ensemble members. This study systematically investigates the effects of example representativeness (one-shot strategy) and output diversity (sampling temperature) on LLM ensemble performance. Two one-shot strategies are compared: centroid-based representative examples (proposed) and randomly sampled examples (baseline) and sampling temperature also is varied. The proposed approach with higher temperature setting significantly outperforms random selection by +7.6% (macro-F1) and -10.5% (RMSE). Furthermore, the proposed model exceeds 5-shot prompting by +21.1% (macro-F1) and -24.0% (RMSE). Our findings demonstrate that combining representative example selection with increased temperature provides the appropriate level of diversity to the ensemble. This work highlights the practical importance of both example selection and controlled diversity in designing effective one-shot LLM ensembles.

63. 【2510.13139】Addressing the alignment problem in transportation policy making: an LLM approach

链接https://arxiv.org/abs/2510.13139

作者:Xiaoyu Yan,Tianxing Dai, Yu (Marco)Nie

类目:Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:model-driven decision tools, key challenge, heterogeneous travelers, travelers often diverge, policies produced

备注

点击查看摘要

Abstract:A key challenge in transportation planning is that the collective preferences of heterogeneous travelers often diverge from the policies produced by model-driven decision tools. This misalignment frequently results in implementation delays or failures. Here, we investigate whether large language models (LLMs), noted for their capabilities in reasoning and simulating human decision-making, can help inform and address this alignment problem. We develop a multi-agent simulation in which LLMs, acting as agents representing residents from different communities in a city, participate in a referendum on a set of transit policy proposals. Using chain-of-thought reasoning, LLM agents provide ranked-choice or approval-based preferences, which are aggregated using instant-runoff voting (IRV) to model democratic consensus. We implement this simulation framework with both GPT-4o and Claude-3.5, and apply it for Chicago and Houston. Our findings suggest that LLM agents are capable of approximating plausible collective preferences and responding to local context, while also displaying model-specific behavioral biases and modest divergences from optimization-based benchmarks. These capabilities underscore both the promise and limitations of LLMs as tools for solving the alignment problem in transportation decision-making.

64. 【2510.13117】On the Reasoning Abilities of Masked Diffusion Language Models

链接https://arxiv.org/abs/2510.13117

作者:Anej Svete,Ashish Sabharwal

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Masked diffusion models, Masked diffusion, autoregressive language models, traditional autoregressive language, diffusion models

备注

点击查看摘要

Abstract:Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

65. 【2510.13115】Multi-Label Clinical Text Eligibility Classification and Summarization System

链接https://arxiv.org/abs/2510.13115

作者:Surya Tejaswi Yerramsetty,Almas Fathimah

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, Large Language Models, improve understanding, understanding of human, human health

备注

点击查看摘要

Abstract:Clinical trials are central to medical progress because they help improve understanding of human health and the healthcare system. They play a key role in discovering new ways to detect, prevent, or treat diseases, and it is essential that clinical trials include participants with appropriate and diverse medical backgrounds. In this paper, we propose a system that leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to automate multi-label clinical text eligibility classification and summarization. The system combines feature extraction methods such as word embeddings (Word2Vec) and named entity recognition to identify relevant medical concepts, along with traditional vectorization techniques such as count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency). We further explore weighted TF-IDF word embeddings that integrate both count-based and embedding-based strengths to capture term importance effectively. Multi-label classification using Random Forest and SVM models is applied to categorize documents based on eligibility criteria. Summarization techniques including TextRank, Luhn, and GPT-3 are evaluated to concisely summarize eligibility requirements. Evaluation with ROUGE scores demonstrates the effectiveness of the proposed methods. This system shows potential for automating clinical trial eligibility assessment using data-driven approaches, thereby improving research efficiency.

66. 【2510.13106】RUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models

链接https://arxiv.org/abs/2510.13106

作者:Ruoyu Sun,Da Song,Jiayang Song,Yuheng Huang,Lei Ma

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Natural Language Processing, revolutionize Natural Language, Large Language Models, Language Processing, Large Language

备注: 4 pages, 2 figures, To appear in ASE 2025 Demo Track

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to revolutionize Natural Language Processing (NLP) applications, critical concerns about their trustworthiness persist, particularly in safety and robustness. To address these challenges, we introduce TRUSTVIS, an automated evaluation framework that provides a comprehensive assessment of LLM trustworthiness. A key feature of our framework is its interactive user interface, designed to offer intuitive visualizations of trustworthiness metrics. By integrating well-known perturbation methods like AutoDAN and employing majority voting across various evaluation methods, TRUSTVIS not only provides reliable results but also makes complex evaluation processes accessible to users. Preliminary case studies on models like Vicuna-7b, Llama2-7b, and GPT-3.5 demonstrate the effectiveness of our framework in identifying safety and robustness vulnerabilities, while the interactive interface allows users to explore results in detail, empowering targeted model improvements. Video Link: this https URL

67. 【2510.13103】ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models

链接https://arxiv.org/abs/2510.13103

作者:Mingda Li,Xinyu Li,Weinan Zhang,Longxuan Ma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, improve model reliability, Language Models, promising approach

备注

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.

68. 【2510.13079】GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models

链接https://arxiv.org/abs/2510.13079

作者:Chen Zheng,Yuhang Cai,Deyi Liu,Jin Ma,Yiyuan Ma,Yuan Yang,Jing Liu,Yutao Zeng,Xun Zhou,Siyuan Qiao

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Modern large language, Modern large, creating redundant computation, language models leverage, effective model capacity

备注

点击查看摘要

Abstract:Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro's effectiveness across model scales and benchmarks. Analysis demonstrates GatePro's ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.

69. 【2510.13022】On the Role of Preference Variance in Preference Optimization

链接https://arxiv.org/abs/2510.13022

作者:Jiacheng Guo,Zihao Li,Jiahao Qiu,Yue Wu,Mengdi Wang

类目:Computation and Language (cs.CL)

关键词:Direct Preference Optimization, aligning large language, large language models, Direct Preference, Preference Optimization

备注

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emph{preference variance} (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10\% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.

70. 【2510.13008】CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models

链接https://arxiv.org/abs/2510.13008

作者:Pavan Kalyan,Shubhra Mishra,Satya Lokam,Navin Goyal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:covering ages 5-10, ages 5-10, human developmental trajectories, enabling systematic, introduce a comprehensive

备注

点击查看摘要

Abstract:We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models' ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.

71. 【2510.13003】OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning

链接https://arxiv.org/abs/2510.13003

作者:Yifeng Xiong,Xiaohui Xie

类目:Computation and Language (cs.CL)

关键词:large language models, encode essential pre-trained, Low-Rank Adaptation, enables efficient fine-tuning, learned updates interfere

备注

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top-$k$ singular subspace using projections $P_L = I - U_k U_k^\top$ and $P_R = I - V_k V_k^\top$. We prove that this construction exactly preserves the top-$k$ singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce $\rho_k$, a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.

72. 【2510.12997】Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

链接https://arxiv.org/abs/2510.12997

作者:Binxin Gao,Jingjun Han

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enabled Large Language, Large Language Models, Large Language, generating final answers, Test-time scaling

备注: Our benchmark dataset is available at [this https URL](https://huggingface.co/datasets/binxingao/extrem-bench)

点击查看摘要

Abstract:Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.

73. 【2510.12993】A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

链接https://arxiv.org/abs/2510.12993

作者:João A. Leite,Arnav Arora,Silvia Gargova,João Luz,Gustavo Sampaio,Ian Roberts,Carolina Scarton,Kalina Bontcheva

类目:Computation and Language (cs.CL)

关键词:Large Language Models, proficiency of Large, Large Language, human-like proficiency, brought concerns

备注

点击查看摘要

Abstract:The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.

74. 【2510.12992】UNCAP: Uncertainty-Guided Planning Using Natural Language Communication for Cooperative Autonomous Vehicles

链接https://arxiv.org/abs/2510.12992

作者:Neel P. Bhatt,Po-han Li,Kushagra Gupta,Rohan Siva,Daniel Milan,Alexander T. Hogue,Sandeep P. Chinchali,David Fridovich-Keil,Zhangyang Wang,Ufuk Topcu

类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:Safe large-scale coordination, multiple cooperative connected, cooperative connected autonomous, efficient and interpretable, connected autonomous vehicles

备注

点击查看摘要

Abstract:Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a vision-language model-based planning approach that enables CAVs to communicate via lightweight natural language messages while explicitly accounting for perception uncertainty in decision-making. UNCAP features a two-stage communication protocol: (i) an ego CAV first identifies the subset of vehicles most relevant for information exchange, and (ii) the selected CAVs then transmit messages that quantitatively express their perception uncertainty. By selectively fusing messages that maximize mutual information, this strategy allows the ego vehicle to integrate only the most relevant signals into its decision-making, improving both the scalability and reliability of cooperative planning. Experiments across diverse driving scenarios show a 63% reduction in communication bandwidth with a 31% increase in driving safety score, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events. Project website: this https URL

75. 【2510.12979】DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

链接https://arxiv.org/abs/2510.12979

作者:Wei Fan,Wenlin Yao,Zheng Li,Feng Yao,Xin Liu,Liang Qiu,Qingyu Yin,Yangqiu Song,Bing Yin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, leveraging external tools, tackle complex tasks, Large language, action generation abilities

备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.

76. 【2510.12966】3-Model Speculative Decoding

链接https://arxiv.org/abs/2510.12966

作者:Sanghyun Byun,Mohanad Odema,Jung Ick Guack,Baisub Lee,Jacob Song,Woo Seong Chung

类目:Computation and Language (cs.CL)

关键词:large language models, smaller draft models, Pyramid Speculative Decoding, Speculative Decoding, larger target model

备注: Accepted at NeurIPS SPIGM 2025

点击查看摘要

Abstract:Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a trade-off between draft model size and token acceptance: smaller draft models generate tokens more quickly but exhibit greater divergence from the target model, resulting in lower acceptance rates and reduced speedups. We introduce Pyramid Speculative Decoding (PyramidSD), an extension of SD that inserts an intermediate qualifier model between the draft and target to bridge the distributional gap in output predictions, allowing smaller model to be used for drafting. This hierarchical decoding strategy improves alignment across models, enabling higher acceptance rates and allowing the use of significantly smaller draft models without sacrificing overall performance. PyramidSD builds on fuzzy acceptance criteria to support relaxed divergence thresholds at each stage, improving throughput. In experiments, PyramidSD achieves up to 1.91x generation speed over standard SD, reaching 124 tokens per second on a consumer GPU (RTX 4090). In small-memory settings with a 1B-parameter draft model and an 8B target model, PyramidSD minimally trades target model quality for improved throughput. Overall, PyramidSD offers a practical approach to enhancing speculative decoding efficiency and can be readily applied to existing inference pipelines.

77. 【2510.12943】he Curious Case of Curiosity across Human Cultures and LLMs

链接https://arxiv.org/abs/2510.12943

作者:Angana Borah,Rada Mihalcea

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Recent advances, advances in Large, Language Models

备注: Preprint (Paper under review)

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have expanded their role in human interaction, yet curiosity -- a central driver of inquiry -- remains underexplored in these systems, particularly across cultural contexts. In this work, we investigate cultural variation in curiosity using Yahoo! Answers, a real-world multi-country dataset spanning diverse topics. We introduce CUEST (CUriosity Evaluation across SocieTies), an evaluation framework that measures human-model alignment in curiosity through linguistic (style), topic preference (content) analysis and grounding insights in social science constructs. Across open- and closed-source models, we find that LLMs flatten cross-cultural diversity, aligning more closely with how curiosity is expressed in Western countries. We then explore fine-tuning strategies to induce curiosity in LLMs, narrowing the human-model alignment gap by up to 50\%. Finally, we demonstrate the practical value of curiosity for LLM adaptability across cultures, showing its importance for future NLP research.

78. 【2510.12931】Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

链接https://arxiv.org/abs/2510.12931

作者:Sanghyun Byun,Jung Ick Guack,Mohanad Odema,Baisub Lee,Jacob Song,Woo Seong Chung

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:achieve remarkable performance, large-scale image-text pretraining, achieve remarkable, image-text pretraining, Unified Vision-Language Alignment

备注: Accepted to PMLR and NeurIPS 2025 UniReps

点击查看摘要

Abstract:Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.

79. 【2510.12925】Who's Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Question Answering

链接https://arxiv.org/abs/2510.12925

作者:Nil-Jana Akpinar,Chia-Jung Lee,Vanessa Murdock,Pietro Perona

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, self-disclosed personal information, factual questions truthfully, answer factual questions

备注

点击查看摘要

Abstract:Large Language Models (LLMs) should answer factual questions truthfully, grounded in objective knowledge, regardless of user context such as self-disclosed personal information, or system personalization. In this paper, we present the first systematic evaluation of LLM robustness to inquiry personas, i.e. user profiles that convey attributes like identity, expertise, or belief. While prior work has primarily focused on adversarial inputs or distractors for robustness testing, we evaluate plausible, human-centered inquiry persona cues that users disclose in real-world interactions. We find that such cues can meaningfully alter QA accuracy and trigger failure modes such as refusals, hallucinated limitations, and role confusion. These effects highlight how model sensitivity to user framing can compromise factual reliability, and position inquiry persona testing as an effective tool for robustness evaluation.

80. 【2510.12915】oward LLM-Supported Automated Assessment of Critical Thinking Subskills

链接https://arxiv.org/abs/2510.12915

作者:Marisa C. Peczuh,Nischal Ashok Kumar,Ryan Baker,Blair Lehman,Danielle Eisenberg,Caitlin Mills,Keerthi Chebrolu,Sudhip Nashi,Cadence Young,Brayden Liu,Sherry Lachman,Andrew Lan

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:today education landscape, Critical thinking, education landscape, fundamental competency, competency in today

备注: preprint: 17 pages

点击查看摘要

Abstract:Critical thinking represents a fundamental competency in today's education landscape. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in the learning analytics community on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring core "subskills" that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking: student-written argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, GPT-5-mini, and ModernBERT). GPT-5 with few-shot prompting achieved the strongest results and demonstrated particular strength on subskills with separable, frequent categories, while lower performance was observed for subskills that required detection of subtle distinctions or rare categories. Our results underscore critical trade-offs in automated critical thinking assessment: proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories. Our work represents an initial step toward scalable assessment of higher-order reasoning skills across authentic educational contexts.

81. 【2510.12899】EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus

链接https://arxiv.org/abs/2510.12899

作者:Shouang Wei,Min Zhang,Xin Lin,Bo Jiang,Zhongxiang Dai,Kun Kuang

类目:Computation and Language (cs.CL)

关键词:large language models, proposed to evaluate, evaluate the conversational, large language, multi-turn dialogue benchmarks

备注

点击查看摘要

Abstract:Recently, several multi-turn dialogue benchmarks have been proposed to evaluate the conversational abilities of large language models (LLMs). As LLMs are increasingly recognized as a key technology for advancing intelligent education, owing to their ability to deeply understand instructional contexts and provide personalized guidance, the construction of dedicated teacher-student dialogue benchmarks has become particularly important. To this end, we present EduDial, a comprehensive multi-turn teacher-student dialogue dataset. EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents. Its design is guided by Bloom's taxonomy of educational objectives and incorporates ten questioning strategies, including situational questioning, zone of proximal development (ZPD) questioning, and metacognitive questioning-thus better capturing authentic classroom interactions. Furthermore, we design differentiated teaching strategies for students at different cognitive levels, thereby providing more targeted teaching guidance. Building on EduDial, we further develop EduDial-LLM 32B via training and propose an 11-dimensional evaluation framework that systematically measures the teaching abilities of LLMs, encompassing both overall teaching quality and content quality. Experiments on 17 mainstream LLMs reveal that most models struggle in student-centered teaching scenarios, whereas our EduDial-LLM achieves significant gains, consistently outperforming all baselines across all metrics. The code is available at this https URL.

82. 【2510.12864】From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models

链接https://arxiv.org/abs/2510.12864

作者:Imran Khan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, human common sense, Language Models, RID framework

备注: 13 pages. Code and data are available at [this https URL](https://github.com/strongSoda/LITERAL-TO-LIBERAL)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems, yet they exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent. This "rule-rigidity" is a significant barrier to building trustworthy autonomous agents. While prior work has shown that supervised fine-tuning (SFT) with human explanations can mitigate this issue, SFT is computationally expensive and inaccessible to many practitioners. To address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a novel, low-compute meta-prompting technique designed to elicit human-aligned exception handling in LLMs in a zero-shot manner. The RID framework provides the model with a structured cognitive schema for deconstructing tasks, classifying rules, weighing conflicting outcomes, and justifying its final decision. We evaluated the RID framework against baseline and Chain-of-Thought (CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced judgment across diverse domains. Our human-verified results demonstrate that the RID framework significantly improves performance, achieving a 95% Human Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT. Furthermore, it consistently produces higher-quality, intent-driven reasoning. This work presents a practical, accessible, and effective method for steering LLMs from literal instruction-following to liberal, goal-oriented reasoning, paving the way for more reliable and pragmatic AI agents.

83. 【2510.12858】A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation

链接https://arxiv.org/abs/2510.12858

作者:Mohammed Hilal Al-Kharusi,Khizar Hayat,Khalil Bader Al Ruqeishi,Haroon Rashid Lone

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:faces significant pedagogical, significant pedagogical challenges, governed by precise, precise phonetic, faces significant

备注: 33 pages

点击查看摘要

Abstract:The sacred practice of Quranic recitation (Tajweed), governed by precise phonetic, prosodic, and theological rules, faces significant pedagogical challenges in the modern era. While digital technologies promise unprecedented access to education, automated tools for recitation evaluation have failed to achieve widespread adoption or pedagogical efficacy. This literature review investigates this critical gap, conducting a comprehensive analysis of academic research, web platforms, and commercial applications developed over the past two decades. Our synthesis reveals a fundamental misalignment in prevailing approaches that repurpose Automatic Speech Recognition (ASR) architectures, which prioritize lexical recognition over qualitative acoustic assessment and are plagued by data dependency, demographic biases, and an inability to provide diagnostically useful feedback. Critiquing these data--driven paradigms, we argue for a foundational paradigm shift towards a knowledge-centric computational framework. Capitalizing on the immutable nature of the Quranic text and the precisely defined rules of Tajweed, we propose that a robust evaluator must be architected around anticipatory acoustic modeling based on canonical rules and articulation points (Makhraj), rather than relying on statistical patterns learned from imperfect and biased datasets. This review concludes that the future of automated Quranic evaluation lies in hybrid systems that integrate deep linguistic knowledge with advanced audio analysis, offering a path toward robust, equitable, and pedagogically sound tools that can faithfully support learners worldwide.

84. 【2510.12856】Efficient Adaptive Transformer: An Empirical Study and Reproducible Framework

链接https://arxiv.org/abs/2510.12856

作者:Jan Miller

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:progressive token pruning, Efficient Adaptive Transformer, adaptive efficiency techniques, dynamic early exiting, sparse attention

备注: 10 pages, 6 figures, pgfplots tables included; BibTeX compiled to .bbl. Code and reproducibility artifacts referenced in the paper

点击查看摘要

Abstract:The Efficient Adaptive Transformer (EAT) framework unifies three adaptive efficiency techniques - progressive token pruning, sparse attention, and dynamic early exiting - into a single, reproducible architecture for input-adaptive inference. EAT provides an open-source benchmarking pipeline that automates data processing, timing, and ablation across GLUE tasks (SST-2, QQP, MNLI). Although this empirical study finds that combining these mechanisms can increase latency in shallow six-layer models, it demonstrates that EAT achieves slightly higher accuracy than the optimized DistilBERT baseline on SST-2, illustrating the potential of dynamic computation for latency-sensitive NLP. The main contribution is the open, end-to-end reproducible framework - complete with scripts, CSV logging, and analysis utilities - intended to serve as a community tool for further research on adaptive transformers.

85. 【2510.12845】VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

链接https://arxiv.org/abs/2510.12845

作者:Jesse Atuhurra,Iqra Ali,Tomoya Iwakura,Hidetaka Kamigaito,Tatsuya Hiraoka

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Vision Language Models, Vision Language, pivotal for advancing, advancing perception, predominantly English-centric benchmarks

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.

86. 【2510.12839】FaStFACT: Faster, Stronger Long-Form Factuality Evaluations in LLMs

链接https://arxiv.org/abs/2510.12839

作者:Yingjia Wan,Haochen Tan,Xiao Zhu,Xinyu Zhou,Zhiwei Li,Qingsong Lv,Changxuan Sun,Jiaqi Zeng,Yi Xu,Jianqiao Lu,Yinhong Liu,Zhijiang Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)

关键词:Large Language Models, remains challenging due, costly human assessment, Language Models, Large Language

备注: EMNLP 2025 (Findings)

点击查看摘要

Abstract:Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line snippets. To address these limitations, we propose \name, a fast and strong evaluation framework that achieves the highest alignment with human evaluation and efficiency among existing baselines. \name first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it collects document-level evidence from crawled webpages and selectively retrieves it during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of \name in both efficiently and effectively evaluating the factuality of long-form LLM generations. Code and benchmark data is available at this https URL.

Comments:
EMNLP 2025 (Findings)

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)

Cite as:
arXiv:2510.12839 [cs.CL]

(or
arXiv:2510.12839v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.12839

Focus to learn more

              arXiv-issued DOI via DataCite</p>
87. 【2510.12838】A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

链接https://arxiv.org/abs/2510.12838

作者:Qianben Chen,Jingyi Cao,Jiayu Zhang,Tianrui Qin,Xiaowan Li,King Zhu,Dingfeng Shi,He Zhu,Minghao Liu,Xiaobo Liang,Ge Zhang,Jian Yang,Yuchen Eleanor Jiang,Wangchunshu Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, invoke external tools, Large language, language models split, strengthen internal

备注: 9 pages, 5 figures, submitted to ICLR 2026

点击查看摘要

Abstract:Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

88. 【2510.12835】Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study

链接https://arxiv.org/abs/2510.12835

作者:Kon Woo Kim(National Institute of Informatics, Japan),Rezarta Islamaj(National Library of Medicine, USA),Jin-Dong Kim(Joint Support-Center for Data Science Research, Japan),Florian Boudin(Japanese-French Laboratory of Informatics, CNRS, Nantes University, Japan),Akiko Aizawa(National Institute of Informatics, Japan)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language model, instruct large language, text annotation tasks, language model, investigates how existing

备注: 11 pages, 2 figures, 3 tables, This is a preprint of the article accepted at NLDB 2025 (Springer LNCS). The final version is available at [this https URL](https://doi.org/10.1007/978-3-031-97144-0_13)

点击查看摘要

Abstract:This study investigates how existing annotation guidelines can be repurposed to instruct large language model (LLM) annotators for text annotation tasks. Traditional guidelines are written for human annotators who internalize training, while LLMs require explicit, structured instructions. We propose a moderation-oriented guideline repurposing method that transforms guidelines into clear directives for LLMs through an LLM moderation process. Using the NCBI Disease Corpus as a case study, our experiments show that repurposed guidelines can effectively guide LLM annotators, while revealing several practical challenges. The results highlight the potential of this workflow to support scalable and cost-effective refinement of annotation guidelines and automated annotation.

89. 【2510.12831】MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

链接https://arxiv.org/abs/2510.12831

作者:Taicheng Guo,Hai Wang,ChaoChun Liu,Mohsen Golalikhani,Xin Chen,Xiangliang Zhang,Chandan K. Reddy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

关键词:executable SQL, SQL while preserving, user conversational utterances, aims to translate, target schema

备注

点击查看摘要

Abstract:Multi-turn Text-to-SQL aims to translate a user's conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute - verify - refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.

90. 【2510.12829】Mathematics with large language models as provers and verifiers

链接https://arxiv.org/abs/2510.12829

作者:Hieu Le Duc,Leo Liberti

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)

关键词:International Mathematical Olympiad, interesting success stories, started reporting interesting, reporting interesting success, Feldman Karbasi

备注

点击查看摘要

Abstract:During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology was able to solve five out of six 2025 IMO problems, and close a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].

91. 【2510.12826】Scheming Ability in LLM-to-LLM Strategic Interactions

链接https://arxiv.org/abs/2510.12826

作者:Thao Pham

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:diverse contexts, evaluating their capacity, large language model, large language, deployed autonomously

备注: 25 pages, 13 figures, under review at IASEAI'26

点击查看摘要

Abstract:As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM-to-LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game-theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain-of-thought reasoning. When prompted, most models, especially Gemini-2.5-pro and Claude-3.7-Sonnet, achieved near-perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95-100% rates. These findings highlight the need for robust evaluations using high-stakes game-theoretic scenarios in multi-agent settings.

92. 【2510.12825】Classifier-Augmented Generation for Structured Workflow Prediction

链接https://arxiv.org/abs/2510.12825

作者:Thomas Gschwind,Shramona Chakraborty,Nitin Gupta,Sameep Mehta

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

关键词:requires deep tool, visually assemble complex, assemble complex data, remains time consuming, deep tool knowledge

备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.

93. 【2510.12818】MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

链接https://arxiv.org/abs/2510.12818

作者:Rajarshi Ghosh,Abhay Gupta,Hudson McBride,Anurag Vaidya,Faisal Mahmood

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Large language models, Large language, subtle demographic cues, clinical decision support, decision support

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS 0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.

94. 【2510.12817】From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP

链接https://arxiv.org/abs/2510.12817

作者:Shanshan Xu,Santosh T.Y.S.S,Barbara Plank

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Human Label Variation, Label Variation, refers to legitimate, mere error, legitimate disagreement

备注

点击查看摘要

Abstract:Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly over the last decade has it been reframed as a signal for improving model robustness. With the rise of large language models (LLMs), where post-training on human feedback has become central to model alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely aggregate multiple annotations into a single label, thereby flattening diverse perspectives into a false universal agreement and erasing precisely the pluralism of human values that alignment aims to preserve. In this position paper, we argue that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck - a goal it self when designing AI systems. We call for proactively incorporating HLV into preference datasets and outline actionable steps towards it.

95. 【2510.12813】Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

链接https://arxiv.org/abs/2510.12813

作者:Soheil Hashtarkhani,Rezaur Rashid,Christopher L Brett,Lokesh Chinthala,Fekede Asefa Kumsa,Janet A Zink,Robert L Davis,David L Schwartz,Arash Shaban-Nejad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:requiring efficient preprocessing, predictive health care, enable predictive health, Electronic health records, health care models

备注: 8 Pages

点击查看摘要

Abstract:Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation. The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data. We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14predefined categories. Two oncology experts validated classifications. BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.

96. 【2510.12807】Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

链接https://arxiv.org/abs/2510.12807

作者:Mahdi Cherakhloo,Arash Abbasi,Mohammad Saeid Sarafraz,Bijan Vosoughi Vahdat

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrated remarkable capabilities, Large Language Models, Large Language, Persian Natural Language, Natural Language Processing

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.

97. 【2510.12803】AutoCode: LLMs as Problem Setters for Competitive Programming

链接https://arxiv.org/abs/2510.12803

作者:Shang Zhou,Zihan Zheng,Kaiyuan Liu,Zeyu Shen,Zerui Cheng,Zexing Chen,Hansen He,Jianzhu Yao,Huanzhi Mao,Qiuyang Mang,Tianfu Fu,Beichen Li,Dongruixuan Li,Wenhao Chai,Zhuang Liu,Aleksandra Korolova,Peter Henderson,Natasha Jaques,Pramod Viswanath,Saining Xie,Jingbo Shang

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词:Writing competitive programming, Writing competitive, Writing, Abstract, test

备注: Project page: [this https URL](https://livecodebenchpro.com/projects/autocode/overview)

点击查看摘要

Abstract:Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.

98. 【2510.13281】wo Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses

链接https://arxiv.org/abs/2510.13281

作者:Sungnyun Kim,Kangwook Jang,Sungwoo Cho,Joon Son Chung,Hoirin Kim,Se-Young Yun

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:audio-visual speech recognition, modality-specific evidences directly, speech recognition, automatic speech recognition, visual speech recognition

备注: Preprint work

点击查看摘要

Abstract:This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at this https URL.

信息检索

1. 【2510.13738】HyMiRec: A Hybrid Multi-interest Learning Framework for LLM-based Sequential Recommendation

链接https://arxiv.org/abs/2510.13738

作者:Jingyi Zhou,Cheng Chen,Kai Zuo,Manjie Xu,Zhendong Fu,Yibo Chen,Xu Tang,Yao Hu

类目:Information Retrieval (cs.IR)

关键词:Large language models, recently demonstrated strong, demonstrated strong potential, Large language, recently demonstrated

备注

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated strong potential for sequential recommendation. However, current LLM-based approaches face critical limitations in modeling users' long-term and diverse interests. First, due to inference latency and feature fetching bandwidth constraints, existing methods typically truncate user behavior sequences to include only the most recent interactions, resulting in the loss of valuable long-range preference signals. Second, most current methods rely on next-item prediction with a single predicted embedding, overlooking the multifaceted nature of user interests and limiting recommendation diversity. To address these challenges, we propose HyMiRec, a hybrid multi-interest sequential recommendation framework, which leverages a lightweight recommender to extracts coarse interest embeddings from long user sequences and an LLM-based recommender to captures refined interest embeddings. To alleviate the overhead of fetching features, we introduce a residual codebook based on cosine similarity, enabling efficient compression and reuse of user history embeddings. To model the diverse preferences of users, we design a disentangled multi-interest learning module, which leverages multiple interest queries to learn disentangles multiple interest signals adaptively, allowing the model to capture different facets of user intent. Extensive experiments are conducted on both benchmark datasets and a collected industrial dataset, demonstrating our effectiveness over existing state-of-the-art methods. Furthermore, online A/B testing shows that HyMiRec brings consistent improvements in real-world recommendation systems.

2. 【2510.13590】RAG Meets Temporal Graphs: Time-Sensitive Modeling and Retrieval for Evolving Knowledge

链接https://arxiv.org/abs/2510.13590

作者:Jiale Han,Austin Cheung,Yubai Wei,Zheng Yu,Xusheng Wang,Bing Zhu,Yi Yang

类目:Information Retrieval (cs.IR)

关键词:RAG, temporal, Knowledge, time, current Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Knowledge is inherently time-sensitive and continuously evolves over time. Although current Retrieval-Augmented Generation (RAG) systems enrich LLMs with external knowledge, they largely ignore this temporal nature. This raises two challenges for RAG. First, current RAG methods lack effective time-aware representations. Same facts of different time are difficult to distinguish with vector embeddings or conventional knowledge graphs. Second, most RAG evaluations assume a static corpus, leaving a blind spot regarding update costs and retrieval stability as knowledge evolves. To make RAG time-aware, we propose Temporal GraphRAG (TG-RAG), which models external corpora as a bi-level temporal graph consisting of a temporal knowledge graph with timestamped relations and a hierarchical time graph. Multi-granularity temporal summaries are generated for each time node to capture both key events and broader trends at that time. The design supports incremental updates by extracting new temporal facts from the incoming corpus and merging them into the existing graph. The temporal graph explicitly represents identical facts at different times as distinct edges to avoid ambiguity, and the time hierarchy graph allows only generating reports for new leaf time nodes and their ancestors, ensuring effective and efficient updates. During inference, TG-RAG dynamically retrieves a subgraph within the temporal and semantic scope of the query, enabling precise evidence gathering. Moreover, we introduce ECT-QA, a time-sensitive question-answering dataset featuring both specific and abstract queries, along with a comprehensive evaluation protocol designed to assess incremental update capabilities of RAG systems. Extensive experiments show that TG-RAG significantly outperforms existing baselines, demonstrating the effectiveness of our method in handling temporal knowledge and incremental updates.

3. 【2510.13371】MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation

链接https://arxiv.org/abs/2510.13371

作者:Jiin Park,Misuk Kim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:large language models, integrate large language, simple text generation, Driven LLM Agent, Recent attempts

备注: 18 pages

点击查看摘要

Abstract:Recent attempts to integrate large language models (LLMs) into recommender systems have gained momentum, but most remain limited to simple text generation or static prompt-based inference, failing to capture the complexity of user preferences and real-world interactions. This study proposes the Multi-Aspect Driven LLM Agent MADRec, an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews and performs direct recommendation, sequential recommendation, and explanation generation. MADRec generates structured profiles via aspect-category-based summarization and applies Re-Ranking to construct high-density inputs. When the ground-truth item is missing from the output, the Self-Feedback mechanism dynamically adjusts the inference criteria. Experiments across multiple domains show that MADRec outperforms traditional and LLM-based baselines in both precision and explainability, with human evaluation further confirming the persuasiveness of the generated explanations.

4. 【2510.13359】Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

链接https://arxiv.org/abs/2510.13359

作者:Yuki Yada,Sho Akiyama,Ryo Watanabe,Yuta Ueno,Yusuke Shido,Andre Rusli

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:recommending visually similar, active monthly users, visually similar products, efficiently discover items, large-scale e-commerce platforms

备注: Accepted to ACM RecSys 2025 (Spotlight)

点击查看摘要

Abstract:On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM) -- which has demonstrated strong performance in image recognition and image-text retrieval tasks -- to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.

5. 【2510.13312】ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

链接https://arxiv.org/abs/2510.13312

作者:Simon Lupart,Mohammad Aliannejadi,Evangelos Kanoulas

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:conversational question answering, reasoning framework based, reinforcement learning, question answering, framework based

备注

点击查看摘要

Abstract:We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.

6. 【2510.13229】Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation

链接https://arxiv.org/abs/2510.13229

作者:Yi Zhang,Lili Xie,Ruihong Qiu,Jiajun Liu,Sen Wang

类目:Information Retrieval (cs.IR)

关键词:diverse digital platforms, enhancing user engagement, delivering personalized content, Recommender systems, digital platforms

备注: ICDM 2025 Accepted Paper

点击查看摘要

Abstract:Recommender systems (RecSys) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms. Recent advancements in large language models (LLMs) demonstrate significant potential for improving RecSys, primarily due to their exceptional generalization capabilities and sophisticated contextual understanding, which facilitate the generation of flexible and interpretable recommendations. However, the direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues stemming from frequent API calls and inherent model limitations such as hallucinations and biases. To address these issues, this paper proposes a novel offline reinforcement learning (RL) framework that leverages imitation learning from LLM-generated trajectories. Specifically, inverse reinforcement learning is employed to extract robust reward models from LLM demonstrations. This approach negates the need for LLM fine-tuning, thereby substantially reducing computational overhead. Simultaneously, the RL policy is guided by the cumulative rewards derived from these demonstrations, effectively transferring the semantic insights captured by the LLM. Comprehensive experiments conducted on two benchmark datasets validate the effectiveness of the proposed method, demonstrating superior performance when compared against state-of-the-art RL-based and in-context learning baselines. The code can be found at this https URL.

7. 【2510.13217】LLM-guided Hierarchical Retrieval

链接https://arxiv.org/abs/2510.13217

作者:Nilesh Gupta,Wei-Cheng Chang,Ngot Bui,Cho-Jui Hsieh,Inderjit S. Dhillon

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:require deep reasoning, Modern IR systems, answering complex, multi-faceted queries, systems are increasingly

备注

点击查看摘要

Abstract:Modern IR systems are increasingly tasked with answering complex, multi-faceted queries that require deep reasoning rather than simple keyword or semantic matching. While LLM-based IR has shown great promise, the prevailing retrieve-then-rerank paradigm inherits the limitations of embedding-based retrieval; parametric generative approaches are difficult to update with new information; and long-context methods that place the entire corpus in context are computationally infeasible for large document collections. To address these challenges, we introduce LATTICE, a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity by imposing a semantic tree structure on the corpus. Our approach consists of two stages: (1) an offline phase that organizes the corpus into a semantic hierarchy via either a bottom-up agglomerative strategy or a top-down divisive strategy using multi-level summaries and (2) an online traversal phase where a search LLM navigates this tree. A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy, making cross-branch and cross-level comparisons difficult. To overcome this, we propose a traversal algorithm that estimates calibrated latent relevance scores from local LLM outputs and aggregates them into a global path relevance metric. Our training-free framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark, demonstrating up to 9% improvement in Recall@100 and 5% in nDCG@10 over the next best zero-shot baseline. Furthermore, compared to the fine-tuned SOTA method DIVER-v2, LATTICE attains comparable results on BRIGHT subsets that use a static corpus for evaluation.

8. 【2510.13193】ReMindRAG: Low-Cost LLM-Guided Knowledge Graph Traversal for Efficient RAG

链接https://arxiv.org/abs/2510.13193

作者:Yikuan Hu,Jifeng Zhu,Lanrui Tang,Chen Huang

类目:Information Retrieval (cs.IR)

关键词:Retrieval Augmented Generation, enhancing Retrieval Augmented, Augmented Generation, Retrieval Augmented, structured representation capabilities

备注

点击查看摘要

Abstract:Knowledge graphs (KGs), with their structured representation capabilities, offer promising avenue for enhancing Retrieval Augmented Generation (RAG) systems, leading to the development of KG-RAG systems. Nevertheless, existing methods often struggle to achieve effective synergy between system effectiveness and cost efficiency, leading to neither unsatisfying performance nor excessive LLM prompt tokens and inference time. To this end, this paper proposes REMINDRAG, which employs an LLM-guided graph traversal featuring node exploration, node exploitation, and, most notably, memory replay, to improve both system effectiveness and cost efficiency. Specifically, REMINDRAG memorizes traversal experience within KG edge embeddings, mirroring the way LLMs "memorize" world knowledge within their parameters, but in a train-free manner. We theoretically and experimentally confirm the effectiveness of REMINDRAG, demonstrating its superiority over existing baselines across various benchmark datasets and LLM backbones. Our code is available at this https URL.

9. 【2510.13095】Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval

链接https://arxiv.org/abs/2510.13095

作者:Yingchen zhang,Ruqing zhang,Jiafeng Guo,Wenjun Peng,Sen Li,Fuyu Lv

类目:Information Retrieval (cs.IR)

关键词:leverages large language, generate document identifiers, autoregressively generate document, large language models, document identifiers

备注

点击查看摘要

Abstract:Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2510.13095 [cs.IR]

(or
arXiv:2510.13095v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2510.13095

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2510.12959】Post-hoc Popularity Bias Correction in GNN-based Collaborative Filtering

链接https://arxiv.org/abs/2510.12959

作者:Md Aminul Islam,Elena Zheleva,Ren Wang

类目:Information Retrieval (cs.IR)

关键词:historical interaction data, collaborative filtering, User historical interaction, primary signal, popularity

备注

点击查看摘要

Abstract:User historical interaction data is the primary signal for learning user preferences in collaborative filtering (CF). However, the training data often exhibits a long-tailed distribution, where only a few items have the majority of interactions. CF models trained directly on such imbalanced data are prone to learning popularity bias, which reduces personalization and leads to suboptimal recommendation quality. Graph Neural Networks (GNNs), while effective for CF due to their message passing mechanism, can further propagate and amplify popularity bias through their aggregation process. Existing approaches typically address popularity bias by modifying training objectives but fail to directly counteract the bias propagated during GNN's neighborhood aggregation. Applying weights to interactions during aggregation can help alleviate this problem, yet it risks distorting model learning due to unstable node representations in the early stages of training. In this paper, we propose a Post-hoc Popularity Debiasing (PPD) method that corrects for popularity bias in GNN-based CF and operates directly on pre-trained embeddings without requiring retraining. By estimating interaction-level popularity and removing popularity components from node representations via a popularity direction vector, PPD reduces bias while preserving user preferences. Experimental results show that our method outperforms state-of-the-art approaches for popularity bias correction in GNN-based CF.

11. 【2510.12953】Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation

链接https://arxiv.org/abs/2510.12953

作者:Xiao He,Huangxuan Zhao,Guojia Wan,Wei Zhou,Yanxing Liu,Juhua Liu,Yongchao Xu,Yong Luo,Dacheng Tao,Bo Du

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Recent medical vision-language, Recent medical, anomaly detection, shown promise, promise on tasks

备注

点击查看摘要

Abstract:Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: this https URL.

12. 【2510.12816】Maximum In-Support Return Modeling for Dynamic Recommendation with Language Model Prior

链接https://arxiv.org/abs/2510.12816

作者:Xiaocong Chen,Siyu Wang,Lina Yao

类目:Information Retrieval (cs.IR)

关键词:Reinforcement Learning-based recommender, Reinforcement Learning-based, Learning-based recommender systems, handle sequential recommendation, sequential recommendation tasks

备注: CIKM'25

点击查看摘要

Abstract:Reinforcement Learning-based recommender systems (RLRS) offer an effective way to handle sequential recommendation tasks but often face difficulties in real-world settings, where user feedback data can be sub-optimal or sparse. In this paper, we introduce MDT4Rec, an offline RLRS framework that builds on the Decision Transformer (DT) to address two major challenges: learning from sub-optimal histories and representing complex user-item interactions. First, MDT4Rec shifts the trajectory stitching procedure from the training phase to action inference, allowing the system to shorten its historical context when necessary and thereby ignore negative or unsuccessful past experiences. Second, MDT4Rec initializes DT with a pre-trained large language model (LLM) for knowledge transfer, replaces linear embedding layers with Multi-Layer Perceptrons (MLPs) for more flexible representations, and employs Low-Rank Adaptation (LoRA) to efficiently fine-tune only a small subset of parameters. We evaluate MDT4Rec on five public datasets and in an online simulation environment, demonstrating that it outperforms existing methods.

13. 【2510.12815】Energy-Guided Diffusion Sampling for Long-Term User Behavior Prediction in Reinforcement Learning-based Recommendation

链接https://arxiv.org/abs/2510.12815

作者:Xiaocong Chen,Siyu Wang,Lina Yao

类目:Information Retrieval (cs.IR)

关键词:learning-based recommender systems, Reinforcement learning-based recommender, dynamic user preferences, user preferences, learning-based recommender

备注: CIKM'25

点击查看摘要

Abstract:Reinforcement learning-based recommender systems (RL4RS) have gained attention for their ability to adapt to dynamic user preferences. However, these systems face challenges, particularly in offline settings, where data inefficiency and reliance on pre-collected trajectories limit their broader applicability. While offline reinforcement learning methods leverage extensive datasets to address these issues, they often struggle with noisy data and fail to capture long-term user preferences, resulting in suboptimal recommendation policies. To overcome these limitations, we propose Diffusion-enhanced Actor-Critic for Offline RL4RS (DAC4Rec), a novel framework that integrates diffusion processes with reinforcement learning to model complex user preferences more effectively. DAC4Rec leverages the denoising capabilities of diffusion models to enhance the robustness of offline RL algorithms and incorporates a Q-value-guided policy optimization strategy to better handle suboptimal trajectories. Additionally, we introduce an energy-based sampling strategy to reduce randomness during recommendation generation, ensuring more targeted and reliable outcomes. We validate the effectiveness of DAC4Rec through extensive experiments on six real-world offline datasets and in an online simulation environment, demonstrating its ability to optimize long-term user preferences. Furthermore, we show that the proposed diffusion policy can be seamlessly integrated into other commonly used RL algorithms in RL4RS, highlighting its versatility and wide applicability.

计算机视觉

1. 【2510.13809】PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning

链接https://arxiv.org/abs/2510.13809

作者:Sihui Ji,Xi Chen,Xin Tao,Pengfei Wan,Hengshuang Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating visually realistic, visually realistic videos, Video generation models, Video generation, generation models nowadays

备注: Project Page: [this https URL](https://sihuiji.github.io/PhysMaster-Page/)

点击查看摘要

Abstract:Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.

2. 【2510.13808】VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

链接https://arxiv.org/abs/2510.13808

作者:Dominick Reilly,Manish Kumar Govind,Le Xue,Srijan Das

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, substantial distribution shifts, Large Vision-Language, Vision-Language Models, general visual reasoning

备注

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.

3. 【2510.13804】Generative Universal Verifier as Multimodal Meta-Reasoner

链接https://arxiv.org/abs/2510.13804

作者:Xinchen Zhang,Xiaoying Zhang,Youbin Wu,Yanbin Cao,Renrui Zhang,Ruihang Chu,Ling Yang,Yujiu Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:visual verification, introduce Generative Universal, reliable visual verification, providing the fundamental, visual outcomes

备注

点击查看摘要

Abstract:We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

4. 【2510.13802】race Anything: Representing Any Video in 4D via Trajectory Fields

链接https://arxiv.org/abs/2510.13802

作者:Xinhang Liu,Yuxi Xiao,Donny Y. Chen,Jiashi Feng,Yu-Wing Tai,Chi-Keung Tang,Bingyi Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Effective spatio-temporal representation, fundamental to modeling, Effective spatio-temporal, Trajectory Field, predicting dynamics

备注

点击查看摘要

Abstract:Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project page: this https URL.

5. 【2510.13800】Reasoning in Space via Grounding in the World

链接https://arxiv.org/abs/2510.13800

作者:Yiming Chen,Zekun Qi,Wenyao Zhang,Xin Jin,Li Zhang,Peidong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Grounded-Spatial Reasoner, spatial reasoning, reasoning, spatial, Reasoner

备注: 20 pages, 7 figures

点击查看摘要

Abstract:In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

6. 【2510.13796】he Mechanistic Emergence of Symbol Grounding in Language Models

链接https://arxiv.org/abs/2510.13796

作者:Shuyu Wu,Ziqiao Ma,Xiaoxi Luo,Yidong Huang,Josue Torres-Fonseca,Freda Shi,Joyce Chai

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world sensorimotor experiences, sensorimotor experiences, words acquire, acquire their meanings, meanings by connecting

备注

点击查看摘要

Abstract:Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

7. 【2510.13795】Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

链接https://arxiv.org/abs/2510.13795

作者:Yi Zhang,Bolin Ni,Xin-Sheng Chen,Heng-Rui Zhang,Yongming Rao,Houwen Peng,Qinglin Lu,Han Hu,Meng-Hao Guo,Shi-Min Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multimodal large language, open multimodal large, large language models, primarily due, supervised fine-tuning

备注: homepage: [this https URL](https://open-bee.github.io/)

点击查看摘要

Abstract:Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

8. 【2510.13793】NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

链接https://arxiv.org/abs/2510.13793

作者:Nir Goren,Oren Katzir,Abhinav Nakarmi,Eyal Ronen,Mahmood Sharif,Or Patashnik

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:rapid adoption, protecting copyright, proving authorship, seed, diffusion models

备注: code available at: [this https URL](https://github.com/nirgoren/NoisePrints)

点击查看摘要

Abstract:With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.

9. 【2510.13787】Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation

链接https://arxiv.org/abs/2510.13787

作者:Seyed Mohammad Mousavi,Morteza Analoui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ongoing text description, Story continuation focuses, previously observed images, Adaptive Visual Conditioning, focuses on generating

备注

点击查看摘要

Abstract:Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.

10. 【2510.13778】InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

链接https://arxiv.org/abs/2510.13778

作者:Xinyi Chen,Yilun Chen,Yanwei Fu,Ning Gao,Jiaya Jia,Weiyang Jin,Hao Li,Yao Mu,Jiangmiao Pang,Yu Qiao,Yang Tian,Bin Wang,Bolun Wang,Fangjing Wang,Hanqing Wang,Tai Wang,Ziqin Wang,Xueyuan Wei,Chao Wu,Shuai Yang,Jinhui Ye,Junqiu Yu,Jia Zeng,Jingjing Zhang,Jinyu Zhang,Shi Zhang,Feng Zheng,Bowen Zhou,Yangkun Zhu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:advances instruction-following robots, general-purpose intelligence, spatially guided, spatially guided training, spatial grounding

备注: Technical report

点击查看摘要

Abstract:We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at this https URL.

11. 【2510.13774】UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

链接https://arxiv.org/abs/2510.13774

作者:Dominik J. Mühlematter,Lin Che,Ye Hong,Martin Raubal,Nina Wiedemann

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Forecasting urban phenomena, public health indicators, health indicators requires, Forecasting urban, urban phenomena

备注

点击查看摘要

Abstract:Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at this https URL.

12. 【2510.13768】Scaling Vision Transformers for Functional MRI with Flat Maps

链接https://arxiv.org/abs/2510.13768

作者:Connor Lane,Daniel Z. Kaplan,Tanishq Mathew Abraham,Paul S. Scotti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)

关键词:adapting modern deep, modern deep learning, deep learning architectures, functional MRI, key question

备注: NeurIPS 2025 Workshop, Foundation Models for the Brain and Body; Code: [this https URL](https://github.com/MedARC-AI/fmri-fm;) Discord: [this https URL](https://discord.gg/tVR4TWnRM9)

点击查看摘要

Abstract:A key question for adapting modern deep learning architectures to functional MRI (fMRI) is how to represent the data for model input. To bridge the modality gap between fMRI and natural images, we transform the 4D volumetric fMRI data into videos of 2D fMRI activity flat maps. We train Vision Transformers on 2.3K hours of fMRI flat map videos from the Human Connectome Project using the spatiotemporal masked autoencoder (MAE) framework. We observe that masked fMRI modeling performance improves with dataset size according to a strict power scaling law. Downstream classification benchmarks show that our model learns rich representations supporting both fine-grained state decoding across subjects, as well as subject-specific trait decoding across changes in brain state. This work is part of an ongoing open science project to build foundation models for fMRI data. Our code and datasets are available at this https URL.

13. 【2510.13759】Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

链接https://arxiv.org/abs/2510.13759

作者:Kai Zou,Ziqi Huang,Yuhao Dong,Shulin Tian,Dian Zheng,Hongbo Liu,Jingwen He,Bin Liu,Yu Qiao,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current benchmarks rarely, benchmarks rarely examine, jointly enable visual, multimodal models aim, true integration

备注: Equal contributions from frst three authors. Project page: [this https URL](https://vchitect.github.io/Uni-MMMU-Project/) Code: [this https URL](https://github.com/vchitect/Uni-MMMU)

点击查看摘要

Abstract:Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.

14. 【2510.13756】RECODE: Reasoning Through Code Generation for Visual Question Answering

链接https://arxiv.org/abs/2510.13756

作者:Junhong Shen,Mu Cai,Bo Hu,Ameet Talwalkar,David A Ross,Cordelia Schmid,Alireza Fathi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.

15. 【2510.13747】InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

链接https://arxiv.org/abs/2510.13747

作者:Wenwen Tong,Hewei Guo,Dongchuan Ran,Jiangnan Chen,Jiefan Lu,Kaibin Wang,Keqiang Li,Xiaoxu Zhu,Jiakui Li,Kehan Li,Xueheng Li,Lumin Li,Chenxu Guo,Jiasheng Zhou,Jiandong Chen,Xianye Wu,Jiahao Wang,Silei Wu,Lei Chen,Hanming Deng,Yuxuan Song,Dinghao Zhou,Guiping Zhong,Ken Zheng,Shiyin Kang,Lewei Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering comprehensive omni-modal, large language model, comprehensive omni-modal understanding, omni-modal large language, large language

备注

点击查看摘要

Abstract:We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.

16. 【2510.13745】UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy

链接https://arxiv.org/abs/2510.13745

作者:Tianshuo Xu,Kai Wang,Zhifei Chen,Leyi Wu,Tianshui Wen,Fei Chao,Ying-Cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Chinese calligraphy remains, calligraphy remains challenging, Computational replication, replication of Chinese, Chinese calligraphy

备注: 22 pages

点击查看摘要

Abstract:Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbf{UniCalli}, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \href{this https URL}{this URL}.

17. 【2510.13740】Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs

链接https://arxiv.org/abs/2510.13740

作者:Mustafa Munir,Alex Zhang,Radu Marculescu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:convolutional neural nets, conventional convolutional neural, Vision graph neural, common graph construction, Vision Graph Attention

备注: Published in the Proceedings of the Third Learning on Graphs Conference (LoG 2024)

点击查看摘要

Abstract:Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at this https URL.

18. 【2510.13735】Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis

链接https://arxiv.org/abs/2510.13735

作者:Zhenxuan Zhang,Peiyuan Jing,Zi Wang,Ula Briski,Coraline Beitone,Yue Yang,Yinzhe Wu,Fanwen Wang,Liutao Yang,Jiahao Huang,Zhifan Gao,Zhaolin Chen,Kh Tohidul Islam,Guang Yang,Peter J. Lally

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:holds significant potential, MRI holds significant, low-field MRI holds, low-field MRI, significant potential

备注

点击查看摘要

Abstract:Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emph{cyclic self-supervised diffusion (CSS-Diff)} framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 $\pm$ 2.70 dB in PSNR, 0.943 $\pm$ 0.102 in SSIM, and 0.0864 $\pm$ 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1$\%$ to 2.1$\%$, cortex from 4.2$\%$ to 3.7$\%$). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.

19. 【2510.13729】LiFMCR: Dataset and Benchmark for Light Field Multi-Camera Registration

链接https://arxiv.org/abs/2510.13729

作者:Aymeric Fleith,Julian Zirbel,Daniel Cremers,Niclas Zeller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:micro lens array, multiple micro lens, based light field, light field, lens array

备注: Accepted at the International Symposium on Visual Computing (ISVC) 2025

点击查看摘要

Abstract:We present LiFMCR, a novel dataset for the registration of multiple micro lens array (MLA)-based light field cameras. While existing light field datasets are limited to single-camera setups and typically lack external ground truth, LiFMCR provides synchronized image sequences from two high-resolution Raytrix R32 plenoptic cameras, together with high-precision 6-degrees of freedom (DoF) poses recorded by a Vicon motion capture system. This unique combination enables rigorous evaluation of multi-camera light field registration methods. As a baseline, we provide two complementary registration approaches: a robust 3D transformation estimation via a RANSAC-based method using cross-view point clouds, and a plenoptic PnP algorithm estimating extrinsic 6-DoF poses from single light field images. Both explicitly integrate the plenoptic camera model, enabling accurate and scalable multi-camera registration. Experiments show strong alignment with the ground truth, supporting reliable multi-view light field processing. Project page: this https URL

Comments:
Accepted at the International Symposium on Visual Computing (ISVC) 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.13729 [cs.CV]

(or
arXiv:2510.13729v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.13729

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2510.13721】NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

链接https://arxiv.org/abs/2510.13721

作者:Run Luo,Xiaobo Xia,Lu Wang,Longze Chen,Renke Shan,Jing Luo,Min Yang,Tat-Seng Chua

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:general intelligence systems, artificial general intelligence, Next-generation multimodal foundation, intelligence systems, playing a pivotal

备注

点击查看摘要

Abstract:Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

21. 【2510.13720】Circle of Willis Centerline Graphs: A Dataset and Baseline Algorithm

链接https://arxiv.org/abs/2510.13720

作者:Fabio Musio,Norman Juchler,Kaiyuan Yang,Suprosanna Shit,Chinmay Prabhakar,Bjoern Menze,Sven Hirsch

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Circle of Willis, cerebrovascular pathologies, critical network, network of arteries, implicated in cerebrovascular

备注

点击查看摘要

Abstract:The Circle of Willis (CoW) is a critical network of arteries in the brain, often implicated in cerebrovascular pathologies. Voxel-level segmentation is an important first step toward an automated CoW assessment, but a full quantitative analysis requires centerline representations. However, conventional skeletonization techniques often struggle to extract reliable centerlines due to the CoW's complex geometry, and publicly available centerline datasets remain scarce. To address these challenges, we used a thinning-based skeletonization algorithm to extract and curate centerline graphs and morphometric features from the TopCoW dataset, which includes 200 stroke patients, each imaged with MRA and CTA. The curated graphs were used to develop a baseline algorithm for centerline and feature extraction, combining U-Net-based skeletonization with A* graph connection. Performance was evaluated on a held-out test set, focusing on anatomical accuracy and feature robustness. Further, we used the extracted features to predict the frequency of fetal PCA variants, confirm theoretical bifurcation optimality relations, and detect subtle modality differences. The baseline algorithm consistently reconstructed graph topology with high accuracy (F1 = 1), and the average Euclidean node distance between reference and predicted graphs was below one voxel. Features such as segment radius, length, and bifurcation ratios showed strong robustness, with median relative errors below 5% and Pearson correlations above 0.95. Our results demonstrate the utility of learning-based skeletonization combined with graph connection for anatomically plausible centerline extraction. We emphasize the importance of going beyond simple voxel-based measures by evaluating anatomical accuracy and feature robustness. The dataset and baseline algorithm have been released to support further method development and clinical research.

22. 【2510.13702】MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

链接https://arxiv.org/abs/2510.13702

作者:Minjung Shin,Hyunin Cho,Sooyeon Go,Jin-Hwa Kim,Youngjung Uh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieving controllable generative, controllable generative models, camera pose control, existing multi-view generation, multi-view generation models

备注: Project page: [this https URL](https://minjung-s.github.io/mvcustom)

点击查看摘要

Abstract:Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.

23. 【2510.13698】Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

链接https://arxiv.org/abs/2510.13698

作者:Jonghyun Park,Minhyuk Seo,Jonghyun Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:key challenges, challenges of modern, refusing malicious, provide helpful responses, iterative output adjustments

备注

点击查看摘要

Abstract:One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.

24. 【2510.13684】Generating healthy counterfactuals with denoising diffusion bridge models

链接https://arxiv.org/abs/2510.13684

作者:Ana Lawry Aguila,Peirong Liu,Marina Crespo Aguirre,Juan Eugenio Iglesias

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:holds significant promise, Generating healthy counterfactuals, images holds significant, medical imaging, pathological images holds

备注

点击查看摘要

Abstract:Generating healthy counterfactuals from pathological images holds significant promise in medical imaging, e.g., in anomaly detection or for application of analysis tools that are designed for healthy scans. These counterfactuals should represent what a patient's scan would plausibly look like in the absence of pathology, preserving individual anatomical characteristics while modifying only the pathological regions. Denoising diffusion probabilistic models (DDPMs) have become popular methods for generating healthy counterfactuals of pathology data. Typically, this involves training on solely healthy data with the assumption that a partial denoising process will be unable to model disease regions and will instead reconstruct a closely matched healthy counterpart. More recent methods have incorporated synthetic pathological images to better guide the diffusion process. However, it remains challenging to guide the generative process in a way that effectively balances the removal of anomalies with the retention of subject-specific features. To solve this problem, we propose a novel application of denoising diffusion bridge models (DDBMs) - which, unlike DDPMs, condition the diffusion process not only on the initial point (i.e., the healthy image), but also on the final point (i.e., a corresponding synthetically generated pathological image). Treating the pathological image as a structurally informative prior enables us to generate counterfactuals that closely match the patient's anatomy while selectively removing pathology. The results show that our DDBM outperforms previously proposed diffusion models and fully supervised approaches at segmentation and anomaly detection tasks.

25. 【2510.13678】FlashWorld: High-quality 3D Scene Generation within Seconds

链接https://arxiv.org/abs/2510.13678

作者:Xinyang Li,Tengfei Wang,Zixiao Gu,Shengchuan Zhang,Chunchao Guo,Liujuan Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:possessing superior rendering, superior rendering quality, faster than previous, previous works, works while possessing

备注: Project Page: [this https URL](https://imlixinyang.github.io/FlashWorld-Project-Page/)

点击查看摘要

Abstract:We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

26. 【2510.13675】Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning

链接https://arxiv.org/abs/2510.13675

作者:Hongkuan Zhou,Lavdim Halilaj,Sebastian Monka,Stefan Schmid,Yuqicheng Zhu,Jingcheng Wu,Nadeem Nazer,Steffen Staab

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:link entities depicted, real-world concepts, aims to identify, identify and link, vast and evolving

备注

点击查看摘要

Abstract:Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.

27. 【2510.13670】NTIRE 2025 Challenge on Low Light Image Enhancement: Methods and Results

链接https://arxiv.org/abs/2510.13670

作者:Xiaoning Liu,Zongwei Wu,Florin-Alexandru Vasluianu,Hailong Yan,Bin Ren,Yulun Zhang,Shuhang Gu,Le Zhang,Ce Zhu,Radu Timofte,Kangbiao Shi,Yixu Feng,Tao Hu,Yu Cao,Peng Wu,Yijin Liang,Yanning Zhang,Qingsen Yan,Han Zhou,Wei Dong,Yan Min,Mohab Kishawy,Jun Chen,Pengpeng Yu,Anjin Park,Seung-Soo Lee,Young-Joon Park,Zixiao Hu,Junyv Liu,Huilin Zhang,Jun Zhang,Fei Wan,Bingxin Xu,Hongzhe Liu,Cheng Xu,Weiguo Pan,Songyin Dai,Xunpeng Yi,Qinglong Yan,Yibing Zhang,Jiayi Ma,Changhui Hu,Kerui Hu,Donghang Jing,Tiesheng Chen,Zhi Jin,Hongjun Wu,Biao Huang,Haitao Ling,Jiahao Wu,Dandan Zhan,G Gyaneshwar Rao,Vijayalaxmi Ashok Aralikatti,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Ruirui Lin,Guoxi Huang,Nantheera Anantrasirichai,Qirui Yang,Alexandru Brateanu,Ciprian Orhei,Cosmin Ancuti,Daniel Feijoo,Juan C. Benito,Álvaro García,Marcos V. Conde,Yang Qin,Raul Balmez,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Tianyi Mao,Huan Zheng,Yanyan Wei,Shengeng Tang,Dan Guo,Zhao Zhang,Sabari Nathan,K Uma,A Sasithradevi,B Sathya Bama,S. Mohamed Mansoor Roomi,Ao Li,Xiangtao Zhang,Zhe Liu,Yijie Tang,Jialong Tang,Zhicheng Fu,Gong Chen,Joe Nasti,John Nicholson,Zeyu Xiao,Zhuoyuan Li,Ashutosh Kulkarni,Prashant W. Patil,Santosh Kumar Vipparthi,Subrahmanyam Murala,Duan Liu,Weile Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-Light Image Enhancement, Image Enhancement, highlighting the proposed, final outcomes, Low-Light Image

备注: CVPR NTIRE 2025 Workshop, please refer to [this https URL](https://openaccess.thecvf.com/CVPR2025_workshops/NTIRE)

点击查看摘要

Abstract:This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the competition, with 28 teams ultimately submitting valid entries. This paper thoroughly evaluates the state-of-the-art advancements in LLIE, showcasing the significant progress.

28. 【2510.13669】CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

链接https://arxiv.org/abs/2510.13669

作者:Zian Li,Muhan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:video MAR models, MAR models suffer, video MAR, combining the flexibility, continuous tokenizer

备注

点击查看摘要

Abstract:Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.

29. 【2510.13660】OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild

链接https://arxiv.org/abs/2510.13660

作者:Hongyu Qu,Jianan Wei,Xiangbo Shu,Yazhou Yao,Wenguan Wang,Jinhui Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:estimation methods struggle, gaze estimation methods, gaze estimation, primarily due, methods struggle

备注: Accepted to NeurIPS 2025; Project page: \url{ [this https URL](https://github.com/quhongyu/OmniGaze) }

点击查看摘要

Abstract:Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.

30. 【2510.13652】EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection

链接https://arxiv.org/abs/2510.13652

作者:Huaizhi Qu,Ruichen Zhang,Shuqing Luo,Luchao Qi,Zhihao Zhang,Xiaoming Liu,Roni Sengupta,Tianlong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:driven remarkable progress, Recent advances, foundation models, editing, driven remarkable

备注

点击查看摘要

Abstract:Recent advances in foundation models have driven remarkable progress in image editing, yet their extension to 3D editing remains underexplored. A natural approach is to replace the image editing modules in existing workflows with foundation models. However, their heavy computational demands and the restrictions and costs of closed-source APIs make plugging these models into existing iterative editing strategies impractical. To address this limitation, we propose EditCast3D, a pipeline that employs video generation foundation models to propagate edits from a single first frame across the entire dataset prior to reconstruction. While editing propagation enables dataset-level editing via video models, its consistency remains suboptimal for 3D reconstruction, where multi-view alignment is essential. To overcome this, EditCast3D introduces a view selection strategy that explicitly identifies consistent and reconstruction-friendly views and adopts feedforward reconstruction without requiring costly refinement. In combination, the pipeline both minimizes reliance on expensive image editing and mitigates prompt ambiguities that arise when applying foundation models independently across images. We evaluate EditCast3D on commonly used 3D editing datasets and compare it against state-of-the-art 3D editing baselines, demonstrating superior editing quality and high efficiency. These results establish EditCast3D as a scalable and general paradigm for integrating foundation models into 3D editing pipelines. The code is available at this https URL

31. 【2510.13649】Local-Global Context-Aware and Structure-Preserving Image Super-Resolution

链接https://arxiv.org/abs/2510.13649

作者:Sanchar Palit,Subhasis Chaudhuri,Biplab Banerjee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently achieved significant, achieved significant success, perceptual quality enhancement, image manipulation tasks, Stable Diffusion

备注: 10 pages, 11 figures

点击查看摘要

Abstract:Diffusion models have recently achieved significant success in various image manipulation tasks, including image super-resolution and perceptual quality enhancement. Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content, which makes them particularly attractive for addressing super-resolution tasks. While some existing approaches leverage these models to achieve state-of-the-art results, they often struggle when applied to diverse and highly degraded images, leading to noise amplification or incorrect content generation. To address these limitations, we propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships through Local-Global Context-Aware Attention, enabling the generation of high-quality images. Furthermore, we propose a distribution- and perceptual-aligned conditioning mechanism in the pixel space to enhance perceptual fidelity. This mechanism captures fine-grained pixel-level representations while progressively preserving and refining structural information, transitioning from local content details to the global structural composition. During inference, our method generates high-quality images that are structurally consistent with the original content, mitigating artifacts and ensuring realistic detail restoration. Extensive experiments on multiple super-resolution benchmarks demonstrate the effectiveness of our approach in producing high-fidelity, perceptually accurate reconstructions.

32. 【2510.13643】owards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection

链接https://arxiv.org/abs/2510.13643

作者:Akib Mohammed Khan,Bartosz Krawczyk

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:questions remain unexamined, key questions remain, Foundation models, shown strong performance, remain unexamined

备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Foundation models such as DINOv2 have shown strong performance in few-shot anomaly detection, yet two key questions remain unexamined: (i) how susceptible are these detectors to adversarial perturbations; and (ii) how well do their anomaly scores reflect calibrated uncertainty? Building on AnomalyDINO, a training-free deep nearest-neighbor detector over DINOv2 features, we present one of the first systematic studies of adversarial attacks and uncertainty estimation in this setting. To enable white-box gradient attacks while preserving test-time behavior, we attach a lightweight linear head to frozen DINOv2 features only for crafting perturbations. Using this heuristic, we evaluate the impact of FGSM across the MVTec-AD and VisA datasets and observe consistent drops in F1, AUROC, AP, and G-mean, indicating that imperceptible perturbations can flip nearest-neighbor relations in feature space to induce confident misclassification. Complementing robustness, we probe reliability and find that raw anomaly scores are poorly calibrated, revealing a gap between confidence and correctness that limits safety-critical use. As a simple, strong baseline toward trustworthiness, we apply post-hoc Platt scaling to the anomaly scores for uncertainty estimation. The resulting calibrated posteriors yield significantly higher predictive entropy on adversarially perturbed inputs than on clean ones, enabling a practical flagging mechanism for attack detection while reducing calibration error (ECE). Our findings surface concrete vulnerabilities in DINOv2-based few-shot anomaly detectors and establish an evaluation protocol and baseline for robust, uncertainty-aware anomaly detection. We argue that adversarial robustness and principled uncertainty quantification are not optional add-ons but essential capabilities if anomaly detection systems are to be trustworthy and ready for real-world deployment.

33. 【2510.13638】Challenges, Advances, and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review

链接https://arxiv.org/abs/2510.13638

作者:Chun Wai Chin,Haniza Yazid,Hoi Leong Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ultimately supporting early, supporting early detection, effective treatment planning, Medical image enhancement, Medical image

备注

点击查看摘要

Abstract:Medical image enhancement is crucial for improving the quality and interpretability of diagnostic images, ultimately supporting early detection, accurate diagnosis, and effective treatment planning. Despite advancements in imaging technologies such as X-ray, CT, MRI, and ultrasound, medical images often suffer from challenges like noise, artifacts, and low contrast, which limit their diagnostic potential. Addressing these challenges requires robust preprocessing, denoising algorithms, and advanced enhancement methods, with deep learning techniques playing an increasingly significant role. This systematic literature review, following the PRISMA approach, investigates the key challenges, recent advancements, and evaluation metrics in medical image enhancement. By analyzing findings from 39 peer-reviewed studies, this review provides insights into the effectiveness of various enhancement methods across different imaging modalities and the importance of evaluation metrics in assessing their impact. Key issues like low contrast and noise are identified as the most frequent, with MRI and multi-modal imaging receiving the most attention, while specialized modalities such as histopathology, endoscopy, and bone scintigraphy remain underexplored. Out of the 39 studies, 29 utilize conventional mathematical methods, 9 focus on deep learning techniques, and 1 explores a hybrid approach. In terms of image quality assessment, 18 studies employ both reference-based and non-reference-based metrics, 9 rely solely on reference-based metrics, and 12 use only non-reference-based metrics, with a total of 65 IQA metrics introduced, predominantly non-reference-based. This review highlights current limitations, research gaps, and potential future directions for advancing medical image enhancement.

34. 【2510.13630】AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

链接https://arxiv.org/abs/2510.13630

作者:Amjid Ali,Zulfiqar Ahmad Khan,Altaf Hussain,Muhammad Munsif,Adnan Hussain,Sung Wook Baik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Anomaly recognition, Anomaly recognition plays, role in surveillance, public safety, Anomaly

备注

点击查看摘要

Abstract:Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

35. 【2510.13626】LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

链接https://arxiv.org/abs/2510.13626

作者:Senyu Fei,Siyin Wang,Junhao Shi,Zihao Dai,Jikun Cai,Pengfang Qian,Li Ji,Xinzhe He,Shiduo Zhang,Zhaoye Fei,Jinlan Fu,Jingjing Gong,Xipeng Qiu

类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:report impressive success, impressive success rates, mask fundamental weaknesses, robotic manipulation benchmarks, models report impressive

备注

点击查看摘要

Abstract:Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.

36. 【2510.13620】Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues

链接https://arxiv.org/abs/2510.13620

作者:Chen Chen,Kangcheng Bin,Ting Hu,Jiahao Qi,Xingyue Liu,Tianpeng Liu,Zhen Liu,Yongxiang Liu,Ping Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned aerial vehicles, based object detection, deep learning techniques, Unmanned aerial, images facilitates robust

备注

点击查看摘要

Abstract:Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0° to 75°, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.

37. 【2510.13565】XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

链接https://arxiv.org/abs/2510.13565

作者:Huawei Sun,Zixu Wang,Xiangyuan Peng,Julius Ott,Georg Stettinger,Lorenzo Servadei,Robert Wille

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complementary geometric cues, estimation remains central, radar-camera fusion offers, fusion offers robustness, providing complementary geometric

备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher's saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets.

38. 【2510.13557】Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents

链接https://arxiv.org/abs/2510.13557

作者:David Freire-Obregón,José Salas-Cáceres,Javier Lorenzo-Navarro,Oliverio J. Santana,Daniel Hernández-Sosa,Modesto Castrillón-Santana

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Facial expression recognition, perceptually degraded visual, existing evaluations assume, evaluations assume homogeneous, assume homogeneous data

备注: Accepted for presentation at the International Symposium on Agentic Artificial Intelligence Systems (AAIS 2025)

点击查看摘要

Abstract:Facial expression recognition (FER) must remain robust under both cultural variation and perceptually degraded visual conditions, yet most existing evaluations assume homogeneous data and high-quality imagery. We introduce an agent-based, streaming benchmark that reveals how cross-cultural composition and progressive blurring interact to shape face recognition robustness. Each agent operates in a frozen CLIP feature space with a lightweight residual adapter trained online at sigma=0 and fixed during testing. Agents move and interact on a 5x5 lattice, while the environment provides inputs with sigma-scheduled Gaussian blur. We examine monocultural populations (Western-only, Asian-only) and mixed environments with balanced (5/5) and imbalanced (8/2, 2/8) compositions, as well as different spatial contact structures. Results show clear asymmetric degradation curves between cultural groups: JAFFE (Asian) populations maintain higher performance at low blur but exhibit sharper drops at intermediate stages, whereas KDEF (Western) populations degrade more uniformly. Mixed populations exhibit intermediate patterns, with balanced mixtures mitigating early degradation, but imbalanced settings amplify majority-group weaknesses under high blur. These findings quantify how cultural composition and interaction structure influence the robustness of FER as perceptual conditions deteriorate.

39. 【2510.13546】Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU

链接https://arxiv.org/abs/2510.13546

作者:Ruiqi Ye,Mikel Luján

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF); Robotics (cs.RO)

关键词:Localization and Mapping, Simultaneous Localization, Graphics Processing Units, Programmable Gate Array, Field Programmable Gate

备注: 12 pages, 7 figures

点击查看摘要

Abstract:Feature detection is a common yet time-consuming module in Simultaneous Localization and Mapping (SLAM) implementations, which are increasingly deployed on power-constrained platforms, such as drones. Graphics Processing Units (GPUs) have been a popular accelerator for computer vision in general, and feature detection and SLAM in particular. On the other hand, System-on-Chips (SoCs) with integrated Field Programmable Gate Array (FPGA) are also widely available. This paper presents the first study of hardware-accelerated feature detectors considering a Visual SLAM (V-SLAM) pipeline. We offer new insights by comparing the best GPU-accelerated FAST, Harris, and SuperPoint implementations against the FPGA-accelerated counterparts on modern SoCs (Nvidia Jetson Orin and AMD Versal). The evaluation shows that when using a non-learning-based feature detector such as FAST and Harris, their GPU implementations, and the GPU-accelerated V-SLAM can achieve better run-time performance and energy efficiency than the FAST and Harris FPGA implementations as well as the FPGA-accelerated V-SLAM. However, when considering a learning-based detector such as SuperPoint, its FPGA implementation can achieve better run-time performance and energy efficiency (up to 3.1$\times$ and 1.4$\times$ improvements, respectively) than the GPU implementation. The FPGA-accelerated V-SLAM can also achieve comparable run-time performance compared to the GPU-accelerated V-SLAM, with better FPS in 2 out of 5 dataset sequences. When considering the accuracy, the results show that the GPU-accelerated V-SLAM is more accurate than the FPGA-accelerated V-SLAM in general. Last but not least, the use of hardware acceleration for feature detection could further improve the performance of the V-SLAM pipeline by having the global bundle adjustment module invoked less frequently without sacrificing accuracy.

Comments:
12 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF); Robotics (cs.RO)

ACMclasses:
C.3; C.4; I.4.6

Cite as:
arXiv:2510.13546 [cs.CV]

(or
arXiv:2510.13546v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.13546

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
40. 【2510.13540】Learning Neural Parametric 3D Breast Shape Models for Metrical Surface Reconstruction From Monocular RGB Videos

链接https://arxiv.org/abs/2510.13540

作者:Maximilian Weiherer,Antonia von Riedheim,Vanessa Brébant,Bernhard Egger,Christoph Palm

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:monocular RGB video, record RGB videos, recovering accurate breast, RGB video, RGB videos

备注: 18 pages, 12 figures

点击查看摘要

Abstract:We present a neural parametric 3D breast shape model and, based on this model, introduce a low-cost and accessible 3D surface reconstruction pipeline capable of recovering accurate breast geometry from a monocular RGB video. In contrast to widely used, commercially available yet prohibitively expensive 3D breast scanning solutions and existing low-cost alternatives, our method requires neither specialized hardware nor proprietary software and can be used with any device that is able to record RGB videos. The key building blocks of our pipeline are a state-of-the-art, off-the-shelf Structure-from-motion pipeline, paired with a parametric breast model for robust and metrically correct surface reconstruction. Our model, similarly to the recently proposed implicit Regensburg Breast Shape Model (iRBSM), leverages implicit neural representations to model breast shapes. However, unlike the iRBSM, which employs a single global neural signed distance function (SDF), our approach -- inspired by recent state-of-the-art face models -- decomposes the implicit breast domain into multiple smaller regions, each represented by a local neural SDF anchored at anatomical landmark positions. When incorporated into our surface reconstruction pipeline, the proposed model, dubbed liRBSM (short for localized iRBSM), significantly outperforms the iRBSM in terms of reconstruction quality, yielding more detailed surface reconstruction than its global counterpart. Overall, we find that the introduced pipeline is able to recover high-quality 3D breast geometry within an error margin of less than 2 mm. Our method is fast (requires less than six minutes), fully transparent and open-source, and -- together with the model -- publicly available at this https URL.

41. 【2510.13534】High Semantic Features for the Continual Learning of Complex Emotions: a Lightweight Solution

链接https://arxiv.org/abs/2510.13534

作者:Thibault Geoffroy,gauthier Gerspacher,Lionel Prevost

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:potential catastrophic forgetting, Incremental learning, complex process due, potential catastrophic, catastrophic forgetting

备注: 10 pages, 14 figures

点击查看摘要

Abstract:Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantical features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can be favorably compared to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.

42. 【2510.13515】UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

链接https://arxiv.org/abs/2510.13515

作者:Tiancheng Gu,Kaicheng Yang,Kaichen Zhang,Xiang An,Ziyong Feng,Yueyi Zhang,Weidong Cai,Jiankang Deng,Lidong Bing

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Universal multimodal embedding, Universal multimodal, multimodal embedding, hard negatives, semantic

备注: 12 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

43. 【2510.13493】ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition

链接https://arxiv.org/abs/2510.13493

作者:Deeptimaan Banerjee,Prateek Gothwal,Ashis Kumer Biswas

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:including online education, including online, online education, human-computer interaction, FER

备注: * Current version of the manuscript contains 17 pages including text, 13 figures, and 4 tables. The manuscript is currently under review at a journal

点击查看摘要

Abstract:In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite its significance because of some factors such as variable head positions, occlusions, illumination shifts, and demographic diversity. Engagement detection, which is essential for applications like virtual learning and customer services, is frequently challenging due to FER limitations by many current models. In this article, we propose ExpressNet-MoE, a novel hybrid deep learning model that blends both Convolution Neural Networks (CNNs) and Mixture of Experts (MoE) framework, to overcome the difficulties. Our model dynamically chooses the most pertinent expert networks, thus it aids in the generalization and providing flexibility to model across a wide variety of datasets. Our model improves on the accuracy of emotion recognition by utilizing multi-scale feature extraction to collect both global and local facial features. ExpressNet-MoE includes numerous CNN-based feature extractors, a MoE module for adaptive feature selection, and finally a residual network backbone for deep feature learning. To demonstrate efficacy of our proposed model we evaluated on several datasets, and compared with current state-of-the-art methods. Our model achieves accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013. The results show how adaptive our model is and how it may be used to develop end-to-end emotion recognition systems in practical settings. Reproducible codes and results are made publicly accessible at this https URL.

44. 【2510.13464】hrough the Lens of Doubt: Robust and Efficient Uncertainty Estimation for Visual Place Recognition

链接https://arxiv.org/abs/2510.13464

作者:Emily Miller,Michael Milford,Muhammad Burhan Hafez,SD Ramchurn,Shoaib Ehsan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Visual Place Recognition, identify previously visited, matching current observations, Place Recognition, previously visited locations

备注

点击查看摘要

Abstract:Visual Place Recognition (VPR) enables robots and autonomous vehicles to identify previously visited locations by matching current observations against a database of known places. However, VPR systems face significant challenges when deployed across varying visual environments, lighting conditions, seasonal changes, and viewpoints changes. Failure-critical VPR applications, such as loop closure detection in simultaneous localization and mapping (SLAM) pipelines, require robust estimation of place matching uncertainty. We propose three training-free uncertainty metrics that estimate prediction confidence by analyzing inherent statistical patterns in similarity scores from any existing VPR method. Similarity Distribution (SD) quantifies match distinctiveness by measuring score separation between candidates; Ratio Spread (RS) evaluates competitive ambiguity among top-scoring locations; and Statistical Uncertainty (SU) is a combination of SD and RS that provides a unified metric that generalizes across datasets and VPR methods without requiring validation data to select the optimal metric. All three metrics operate without additional model training, architectural modifications, or computationally expensive geometric verification. Comprehensive evaluation across nine state-of-the-art VPR methods and six benchmark datasets confirms that our metrics excel at discriminating between correct and incorrect VPR matches, and consistently outperform existing approaches while maintaining negligible computational overhead, making it deployable for real-time robotic applications across varied environmental conditions with improved precision-recall performance.

45. 【2510.13454】VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

链接https://arxiv.org/abs/2510.13454

作者:Hyojun Go,Dominik Narnhofer,Goutam Bhat,Prune Truong,Federico Tombari,Konrad Schindler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual content generation, progress of large, rapid progress, visual content, reconstruction opens

备注: Project page: [this https URL](https://gohyojun15.github.io/VIST3A/)

点击查看摘要

Abstract:The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

46. 【2510.13452】Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies

链接https://arxiv.org/abs/2510.13452

作者:Ole-Christian Galbo Engstrøm

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:near-infrared hyperspectral imaging, food quality analysis, hyperspectral imaging, investigates the application, application of near-infrared

备注: PhD thesis

点击查看摘要

Abstract:This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100\%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley's germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset's low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.

47. 【2510.13433】Beyond Pixels: A Differentiable Pipeline for Probing Neuronal Selectivity in 3D

链接https://arxiv.org/abs/2510.13433

作者:Pavithra Elumalai,Mohammad Bashiri,Goirik Chakrabarty,Suhas Shrinivasan,Fabian H. Sinz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual perception relies, relies on inference, visual sensory neurons, perception relies, Visual perception

备注: Accepted in Symmetry and Geometry in Neural Representations 2025 (Extended Abstract Track)

点击查看摘要

Abstract:Visual perception relies on inference of 3D scene properties such as shape, pose, and lighting. To understand how visual sensory neurons enable robust perception, it is crucial to characterize their selectivity to such physically interpretable factors. However, current approaches mainly operate on 2D pixels, making it difficult to isolate selectivity for physical scene properties. To address this limitation, we introduce a differentiable rendering pipeline that optimizes deformable meshes to obtain MEIs directly in 3D. The method parameterizes mesh deformations with radial basis functions and learns offsets and scales that maximize neuronal responses while enforcing geometric regularity. Applied to models of monkey area V4, our approach enables probing neuronal selectivity to interpretable 3D factors such as pose and lighting. This approach bridges inverse graphics with systems neuroscience, offering a way to probe neural selectivity with physically grounded, 3D stimuli beyond conventional pixel-based methods.

48. 【2510.13432】CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation

链接https://arxiv.org/abs/2510.13432

作者:Yushan Han,Hui Zhang,Honglei Zhang,Chuntao Ding,Yuanzhouhan Cao,Yidong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improve individual perception, Domain Separation, Collaborative perception, domain, multi-agent interaction

备注: Accepted by IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Collaborative perception has been proven to improve individual perception in autonomous driving through multi-agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real-world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer-based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial-Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder-specific and encoder-agnostic domain separation modules. The former removes domain-dependent information and the latter captures task-related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency.

49. 【2510.13419】Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter

链接https://arxiv.org/abs/2510.13419

作者:Jianhui Zhang,Sheng Cheng,Qirui Sun,Jia Liu,Wang Luyang,Chaoyu Feng,Chen Fang,Lei Lei,Jue Wang,Shuaicheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution text-guided image, text-guided image inpainting, Dual Context Adapter, Reference Patch Adapter, effective framework

备注

点击查看摘要

Abstract:In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment, two critical challenges in image inpainting that intensify with increasing resolution and texture complexity. Patch-Adapter leverages a two-stage adapter architecture to scale the diffusion model's resolution from 1K to 4K+ without requiring structural overhauls: (1) Dual Context Adapter learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency; and (2) Reference Patch Adapter implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion. This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and Photo-Concept-Bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence.

50. 【2510.13418】Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

链接https://arxiv.org/abs/2510.13418

作者:Yifu Luo,Xinhao Hu,Keyu Fan,Haoyuan Sun,Zeyu Chen,Bo Xia,Tiantian Zhang,Yongzhe Chang,Xueqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:garnered increasing attention, Reinforcement learning, Relative Policy Optimization, Group Relative Policy, garnered increasing

备注

点击查看摘要

Abstract:Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on this https URL

51. 【2510.13394】Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

链接https://arxiv.org/abs/2510.13394

作者:Xinmiao Huang,Qisong He,Zhenglin Huang,Boxuan Wang,Zhuoyun Li,Guangliang Cheng,Yi Dong,Xiaowei Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Language Models, Vision Language, domains including robotics, support real-world applications

备注

点击查看摘要

Abstract:Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.

52. 【2510.13390】Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment

链接https://arxiv.org/abs/2510.13390

作者:Feng-Qi Cui,Yu-Tong Guo,Tianyue Zheng,Jinyang Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:privacy-preserving human-computer interaction, Channel State Information, promising RF sensing, sensing paradigm, paradigm for enabling

备注: Accepted by IEEE ICPADS 2025

点击查看摘要

Abstract:WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.

53. 【2510.13381】Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering

链接https://arxiv.org/abs/2510.13381

作者:Siddharth Tourani,Jayaram Reddy,Akash Kumbar,Satyajit Tourani,Nishant Goyal,Madhava Krishna,N. Dinesh Reddy,Muhammad Haris Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Gaussian Splatting, signed distance function, augmented reality, reconstruction play, play a crucial

备注: Accepted at ICCV-2025, project page: [this https URL](https://dynamic-ugsdf.github.io/)

点击查看摘要

Abstract:Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.

54. 【2510.13375】DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

链接https://arxiv.org/abs/2510.13375

作者:Tianyuan Yuan,Yicheng Liu,Chenhao Lu,Zhuoguang Chen,Tao Jiang,Hang Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:language-guided manipulation capabilities, recently shown impressive, shown impressive generalization, manipulation capabilities, recently shown

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

55. 【2510.13364】Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

链接https://arxiv.org/abs/2510.13364

作者:MingZe Tang,Jubal Chandy Jacob

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent Vision-Language Models, Recent Vision-Language, enable zero-shot classification, shared space, data-scarce conditions

备注

点击查看摘要

Abstract:Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.

56. 【2510.13359】Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

链接https://arxiv.org/abs/2510.13359

作者:Yuki Yada,Sho Akiyama,Ryo Watanabe,Yuta Ueno,Yusuke Shido,Andre Rusli

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:recommending visually similar, active monthly users, visually similar products, efficiently discover items, large-scale e-commerce platforms

备注: Accepted to ACM RecSys 2025 (Spotlight)

点击查看摘要

Abstract:On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM) -- which has demonstrated strong performance in image recognition and image-text retrieval tasks -- to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.

57. 【2510.13349】No-Reference Rendered Video Quality Assessment: Dataset and Metrics

链接https://arxiv.org/abs/2510.13349

作者:Sipeng Yang,Jiayu Ji,Qingchuan Zhu,Zhiyao Yang,Xiaogang Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including video games, virtual reality, computer graphics applications, augmented reality, NR-VQA

备注

点击查看摘要

Abstract:Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts. To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrate our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, we demonstrate that our metric can be used to benchmark supersampling methods and assess frame generation strategies in real-time rendering.

58. 【2510.13331】Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

链接https://arxiv.org/abs/2510.13331

作者:Hong-Kai Zheng,Piji Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Quantized Variational Autoencoders, Vector Quantized Variational, Variational Autoencoders, represent continuous vectors, Quantized Variational

备注

点击查看摘要

Abstract:Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.

59. 【2510.13326】DEF-YOLO: Leveraging YOLO for Concealed Weapon Detection in Thermal Imagin

链接https://arxiv.org/abs/2510.13326

作者:Divya Bhardwaj,Arnav Ramamoorthy,Poonam Goyal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Concealed weapon detection, Concealed weapon, detecting weapons hidden, weapons hidden beneath, weapon detection

备注

点击查看摘要

Abstract:Concealed weapon detection aims at detecting weapons hidden beneath a person's clothing or luggage. Various imaging modalities like Millimeter Wave, Microwave, Terahertz, Infrared, etc., are exploited for the concealed weapon detection task. These imaging modalities have their own limitations, such as poor resolution in microwave imaging, privacy concerns in millimeter wave imaging, etc. To provide a real-time, 24 x 7 surveillance, low-cost, and privacy-preserved solution, we opted for thermal imaging in spite of the lack of availability of a benchmark dataset. We propose a novel approach and a dataset for concealed weapon detection in thermal imagery. Our YOLO-based architecture, DEF-YOLO, is built with key enhancements in YOLOv8 tailored to the unique challenges of concealed weapon detection in thermal vision. We adopt deformable convolutions at the SPPF layer to exploit multi-scale features; backbone and neck layers to extract low, mid, and high-level features, enabling DEF-YOLO to adaptively focus on localization around the objects in thermal homogeneous regions, without sacrificing much of the speed and throughput. In addition to these simple yet effective key architectural changes, we introduce a new, large-scale Thermal Imaging Concealed Weapon dataset, TICW, featuring a diverse set of concealed weapons and capturing a wide range of scenarios. To the best of our knowledge, this is the first large-scale contributed dataset for this task. We also incorporate focal loss to address the significant class imbalance inherent in the concealed weapon detection task. The efficacy of the proposed work establishes a new benchmark through extensive experimentation for concealed weapon detection in thermal imagery.

60. 【2510.13317】Removing Cost Volumes from Optical Flow Estimators

链接https://arxiv.org/abs/2510.13317

作者:Simon Kiefhaber,Stefan Roth,Simone Schaub-Meyer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern optical flow, optical flow estimator, Cost volumes, space complexity, optical flow

备注: ICCV 2025

点击查看摘要

Abstract:Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows removing the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\,\mathrm{FPS}$ using only $500\,\mathrm{MB}$ of GPU memory.

61. 【2510.13316】Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests

链接https://arxiv.org/abs/2510.13316

作者:Fitim Abdullahu,Helmut Grabner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:daily life, life is highly, highly influenced, Large Multimodal Models, Large Multimodal

备注: ICCV 2025

点击查看摘要

Abstract:Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.

62. 【2510.13315】Self-Augmented Visual Contrastive Decoding

链接https://arxiv.org/abs/2510.13315

作者:Eun Woo Im,Muhammad Kashif Ali,Vivek Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, remarkable multimodal capabilities, demonstrated remarkable multimodal, underlying language models, Large Vision-Language

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.

63. 【2510.13310】InstantSfM: Fully Sparse and Parallel Structure-from-Motion

链接https://arxiv.org/abs/2510.13310

作者:Jiankun Zhong,Zitong Zhan,Quankai Gao,Ziyu Chen,Haozhe Lou,Jiageng Mao,Ulrich Neumann,Yue Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recovers camera poses, recovers camera, camera poses, poses and scene, scene geometry

备注

点击查看摘要

Abstract:Structure-from-Motion (SfM), a method that recovers camera poses and scene geometry from uncalibrated images, is a central component in robotic reconstruction and simulation. Despite the state-of-the-art performance of traditional SfM methods such as COLMAP and its follow-up work, GLOMAP, naive CPU-specialized implementations of bundle adjustment (BA) or global positioning (GP) introduce significant computational overhead when handling large-scale scenarios, leading to a trade-off between accuracy and speed in SfM. Moreover, the blessing of efficient C++-based implementations in COLMAP and GLOMAP comes with the curse of limited flexibility, as they lack support for various external optimization options. On the other hand, while deep learning based SfM pipelines like VGGSfM and VGGT enable feed-forward 3D reconstruction, they are unable to scale to thousands of input views at once as GPU memory consumption increases sharply as the number of input views grows. In this paper, we unleash the full potential of GPU parallel computation to accelerate each critical stage of the standard SfM pipeline. Building upon recent advances in sparse-aware bundle adjustment optimization, our design extends these techniques to accelerate both BA and GP within a unified global SfM framework. Through extensive experiments on datasets of varying scales (e.g. 5000 images where VGGSfM and VGGT run out of memory), our method demonstrates up to about 40 times speedup over COLMAP while achieving consistently comparable or even improved reconstruction accuracy. Our project page can be found at this https URL.

64. 【2510.13307】Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning

链接https://arxiv.org/abs/2510.13307

作者:Yang Li,Aming Wu,Zihao Zhang,Yahong Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Class Discovery, point cloud representations, Point Cloud Segmentation, Point Cloud, causal

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method, i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes' causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.

65. 【2510.13303】Automated document processing system for government agencies using DBNET++ and BART models

链接https://arxiv.org/abs/2510.13303

作者:Aya Kaysan Bahjat

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:detects textual content, Differentiable Binarization Network, predefined categories, automatic document classification, presented that detects

备注: 8 pages, 12 figures, article

点击查看摘要

Abstract:An automatic document classification system is presented that detects textual content in images and classifies documents into four predefined categories (Invoice, Report, Letter, and Form). The system supports both offline images (e.g., files on flash drives, HDDs, microSD) and real-time capture via connected cameras, and is designed to mitigate practical challenges such as variable illumination, arbitrary orientation, curved or partially occluded text, low resolution, and distant text. The pipeline comprises four stages: image capture and preprocessing, text detection [1] using a DBNet++ (Differentiable Binarization Network Plus) detector, and text classification [2] using a BART (Bidirectional and Auto-Regressive Transformers) classifier, all integrated within a user interface implemented in Python with PyQt5. The achieved results by the system for text detection in images were good at about 92.88% through 10 hours on Total-Text dataset that involve high resolution images simulate a various and very difficult challenges. The results indicate the proposed approach is effective for practical, mixed-source document categorization in unconstrained imaging scenarios.

66. 【2510.13282】Universal Image Restoration Pre-training via Masked Degradation Classification

链接https://arxiv.org/abs/2510.13282

作者:JiaKui Hu,Zhengjian Yao,Lujia Jin,Yinghao Chen,Yanye Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Classification Pre-Training method, Degradation Classification Pre-Training, Masked Degradation Classification, degradation types, conventional pre-training methods

备注

点击查看摘要

Abstract:This study introduces a Masked Degradation Classification Pre-Training method (MaskDCPT), designed to facilitate the classification of degradation types in input images, leading to comprehensive image restoration pre-training. Unlike conventional pre-training methods, MaskDCPT uses the degradation type of the image as an extremely weak supervision, while simultaneously leveraging the image reconstruction to enhance performance and robustness. MaskDCPT includes an encoder and two decoders: the encoder extracts features from the masked low-quality input image. The classification decoder uses these features to identify the degradation type, whereas the reconstruction decoder aims to reconstruct a corresponding high-quality image. This design allows the pre-training to benefit from both masked image modeling and contrastive learning, resulting in a generalized representation suited for restoration tasks. Benefit from the straightforward yet potent MaskDCPT, the pre-trained encoder can be used to address universal image restoration and achieve outstanding performance. Implementing MaskDCPT significantly improves performance for both convolution neural networks (CNNs) and Transformers, with a minimum increase in PSNR of 3.77 dB in the 5D all-in-one restoration task and a 34.8% reduction in PIQE compared to baseline in real-world degradation scenarios. It also emergences strong generalization to previously unseen degradation types and levels. In addition, we curate and release the UIR-2.5M dataset, which includes 2.5 million paired restoration samples across 19 degradation types and over 200 degradation levels, incorporating both synthetic and real-world data. The dataset, source code, and models are available at this https URL.

67. 【2510.13276】MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

链接https://arxiv.org/abs/2510.13276

作者:Keyan Zhou,Zecheng Tang,Lingfeng Ming,Guanghao Zhou,Qiguang Chen,Dan Qiao,Zheming Yang,Libo Qin,Minghui Qiu,Juntao Li,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:large vision language, vision language models, rapid advancement, advancement of large, large vision

备注

点击查看摘要

Abstract:The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

68. 【2510.13253】End-to-End Multi-Modal Diffusion Mamba

链接https://arxiv.org/abs/2510.13253

作者:Chunhao Lu,Qiang Lu,Meichen Dong,Jake Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:encoders and decoders, decoders to process, process input, input and output, Multi-modal Diffusion Mamba

备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

69. 【2510.13251】Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

链接https://arxiv.org/abs/2510.13251

作者:Minji Kim,Taekyung Kim,Bohyung Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Large Language, Large Language Models, Large Language, video question answering, Video Large

备注: 23 pages, 28 figures, 8 tables

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at this https URL

70. 【2510.13250】Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture

链接https://arxiv.org/abs/2510.13250

作者:Zhiyuan Zhao,Yubin Wen,Siyu Yang,Lichen Ning,Yuandong Liu,Junyu Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:public safety management, Crowd counting, urban planning, intelligent security, public safety

备注

点击查看摘要

Abstract:Crowd counting is a task of estimating the number of the crowd through images, which is extremely valuable in the fields of intelligent security, urban planning, public safety management, and so on. However, the existing counting methods have some problems in practical application on embedded systems for these fields, such as excessive model parameters, abundant complex calculations, etc. The practical application of embedded systems requires the model to be real-time, which means that the model is fast enough. Considering the aforementioned problems, we design a super real-time model with a stem-encoder-decoder structure for crowd counting tasks, which achieves the fastest inference compared with state-of-the-arts. Firstly, large convolution kernels in the stem network are used to enlarge the receptive field, which effectively extracts detailed head information. Then, in the encoder part, we use conditional channel weighting and multi-branch local fusion block to merge multi-scale features with low computational consumption. This part is crucial to the super real-time performance of the model. Finally, the feature pyramid networks are added to the top of the encoder to alleviate its incomplete fusion problems. Experiments on three benchmarks show that our network is suitable for super real-time crowd counting on embedded systems, ensuring competitive accuracy. At the same time, the proposed network reasoning speed is the fastest. Specifically, the proposed network achieves 381.7 FPS on NVIDIA GTX 1080Ti and 71.9 FPS on NVIDIA Jetson TX1.

71. 【2510.13245】CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

链接https://arxiv.org/abs/2510.13245

作者:Li Liang,Bo Miao,Xinyu Wang,Naveed Akhtar,Jordan Vice,Ajmal Mian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically rich environments, generation produces realistic, autonomous driving, scene generation produces, produces realistic

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at this https URL

72. 【2510.13243】FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding

链接https://arxiv.org/abs/2510.13243

作者:Francesco Barbato,Matteo Caligiuri,Pietro Zanuttigh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, computer vision algorithms, environments heavily relies

备注: 20 pages, 7 figures, 10 tables, data and code available

点击查看摘要

Abstract:The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: 1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; 2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; 3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; 4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding.

73. 【2510.13237】Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models

链接https://arxiv.org/abs/2510.13237

作者:Haochuan Xu,Yun Sing Koh,Shuhuai Huang,Zirun Zhou,Di Wang,Jun Sakuma,Jingfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:complex physical robot, natural language instructions, execute complex physical, achieved revolutionary progress, physical robot tasks

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved revolutionary progress in robot learning, enabling robots to execute complex physical robot tasks from natural language instructions. Despite this progress, their adversarial robustness remains underexplored. In this work, we propose both adversarial patch attack and corresponding defense strategies for VLA models. We first introduce the Embedding Disruption Patch Attack (EDPA), a model-agnostic adversarial attack that generates patches directly placeable within the camera's view. In comparison to prior methods, EDPA can be readily applied to different VLA models without requiring prior knowledge of the model architecture, or the controlled robotic manipulator. EDPA constructs these patches by (i) disrupting the semantic alignment between visual and textual latent representations, and (ii) maximizing the discrepancy of latent representations between adversarial and corresponding clean visual inputs. Through the optimization of these objectives, EDPA distorts the VLA's interpretation of visual information, causing the model to repeatedly generate incorrect actions and ultimately result in failure to complete the given robotic task. To counter this, we propose an adversarial fine-tuning scheme for the visual encoder, in which the encoder is optimized to produce similar latent representations for both clean and adversarially perturbed visual inputs. Extensive evaluations on the widely recognized LIBERO robotic simulation benchmark demonstrate that EDPA substantially increases the task failure rate of cutting-edge VLA models, while our proposed defense effectively mitigates this degradation. The codebase is accessible via the homepage at this https URL.

74. 【2510.13235】EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking

链接https://arxiv.org/abs/2510.13235

作者:Yukuan Zhang,Jiarui Zhao,Shangqing Nie,Jin Kuang,Shengsheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown strong potential, enhancing target perception, Multimodal semantic cues, shown strong, strong potential

备注

点击查看摘要

Abstract:Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.

75. 【2510.13234】UniVector: Unified Vector Extraction via Instance-Geometry Interaction

链接https://arxiv.org/abs/2510.13234

作者:Yinglong Yan,Jun Yue,Shaobo Xia,Hanmeng Sun,Tianxu Ying,Chengcheng Wu,Sifan Lan,Min He,Pedram Ghamisi,Leyuan Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering high-fidelity representation, Vector extraction retrieves, extraction retrieves structured, raster images, offering high-fidelity

备注

点击查看摘要

Abstract:Vector extraction retrieves structured vector geometry from raster images, offering high-fidelity representation and broad applicability. Existing methods, however, are usually tailored to a single vector type (e.g., polygons, polylines, line segments), requiring separate models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connections) independently, limiting the ability to capture complex structures. Inspired by the human brain's simultaneous use of semantic and spatial interactions in visual perception, we propose UniVector, a unified VE framework that leverages instance-geometry interaction to extract multiple vector types within a single model. UniVector encodes vectors as structured queries containing both instance- and geometry-level information, and iteratively updates them through an interaction module for cross-level context exchange. A dynamic shape constraint further refines global structures and key points. To benchmark multi-structure scenarios, we introduce the Multi-Vector dataset with diverse polygons, polylines, and line segments. Experiments show UniVector sets a new state of the art on both single- and multi-structure VE tasks. Code and dataset will be released at this https URL.

76. 【2510.13232】What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

链接https://arxiv.org/abs/2510.13232

作者:Inha Kang,Youngsun Lim,Seonho Lee,Jiho Choi,Junsuk Choe,Hyunjung Shim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:critical failure, affirmative bias, negation, vision-language models, girl

备注: 38 pages

点击查看摘要

Abstract:State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

77. 【2510.13226】Sample-Centric Multi-Task Learning for Detection and Segmentation of Industrial Surface Defects

链接https://arxiv.org/abs/2510.13226

作者:Hang-Cheng Dong,Yibo Jiao,Fupeng Wei,Guodong Liu,Dong Ye,Bingguo Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Industrial surface defect, Industrial surface, surface defect inspection, sample-wise quality control, inspection for sample-wise

备注

点击查看摘要

Abstract:Industrial surface defect inspection for sample-wise quality control (QC) must simultaneously decide whether a given sample contains defects and localize those defects spatially. In real production lines, extreme foreground-background imbalance, defect sparsity with a long-tailed scale distribution, and low contrast are common. As a result, pixel-centric training and evaluation are easily dominated by large homogeneous regions, making it difficult to drive models to attend to small or low-contrast defects-one of the main bottlenecks for deployment. Empirically, existing models achieve strong pixel-overlap metrics (e.g., mIoU) but exhibit insufficient stability at the sample level, especially for sparse or slender defects. The root cause is a mismatch between the optimization objective and the granularity of QC decisions. To address this, we propose a sample-centric multi-task learning framework and evaluation suite. Built on a shared-encoder architecture, the method jointly learns sample-level defect classification and pixel-level mask localization. Sample-level supervision modulates the feature distribution and, at the gradient level, continually boosts recall for small and low-contrast defects, while the segmentation branch preserves boundary and shape details to enhance per-sample decision stability and reduce misses. For evaluation, we propose decision-linked metrics, Seg_mIoU and Seg_Recall, which remove the bias of classical mIoU caused by empty or true-negative samples and tightly couple localization quality with sample-level decisions. Experiments on two benchmark datasets demonstrate that our approach substantially improves the reliability of sample-level decisions and the completeness of defect localization.

78. 【2510.13219】Prompt-based Adaptation in Large-scale Vision Models: A Survey

链接https://arxiv.org/abs/2510.13219

作者:Xi Xiao,Yunbei Zhang,Lin Zhao,Yiyang Liu,Xiaoying Liao,Zheda Mai,Xingjian Li,Xiao Wang,Hao Xu,Jihun Hamm,Xue Lin,Min Xu,Qifan Wang,Tianyang Wang,Cheng Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Prompt Tuning, adapting large-scale vision, large-scale vision models, Visual Prompting, Visual Prompt

备注

点击查看摘要

Abstract:In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

79. 【2510.13208】MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation

链接https://arxiv.org/abs/2510.13208

作者:Lianlian Liu,YongKang He,Zhaojie Chu,Xiaofen Xing,Xiangmin Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:signals presents substantial, presents substantial challenges, speech signals presents, Generating stylized, speech signals

备注

点击查看摘要

Abstract:Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.

80. 【2510.13201】Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

链接https://arxiv.org/abs/2510.13201

作者:Jing Yang,Qiyao Wei,Jiaxin Pei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)

关键词:heavy reviewer workloads, inconsistent evaluation standards, expertise mismatches, leading to heavy, reviewer workloads

备注

点击查看摘要

Abstract:The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.

81. 【2510.13198】Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

链接https://arxiv.org/abs/2510.13198

作者:Rongtao Xu,Jinzhou Lin,Jialei Zhou,Jiahua Dong,Changwei Wang,Ruisheng Wang,Li Guo,Shibiao Xu,Xiaodan Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Camera-based occupancy prediction, Camera-based occupancy, perception in autonomous, autonomous driving, aiming to infer

备注

点击查看摘要

Abstract:Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released this https URL

82. 【2510.13186】STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control

链接https://arxiv.org/abs/2510.13186

作者:Zhen Li,Xibin Jin,Guoliang Li,Shuai Wang,Miaowen Wen,Huseyin Arslan,Derrick Wing Kwan Ng,Chengzhong Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Edge Gaussian splatting, Gaussian splatting, Edge Gaussian, edge server, scene reconstruction

备注

点击查看摘要

Abstract:Edge Gaussian splatting (EGS), which aggregates data from distributed clients and trains a global GS model at the edge server, is an emerging paradigm for scene reconstruction. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients' images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot this http URL, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments unveil that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. It is found that the GS-oriented objective can be accurately predicted with low sampling ratios (e.g.,10%), and our method achieves an excellent tradeoff between view contributions and communication costs.

83. 【2510.13160】DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization

链接https://arxiv.org/abs/2510.13160

作者:Meng Yang,Kecheng Chen,Wei Luo,Xianjie Chen,Yong Jia,Mingyue Wang,Fanqiang Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transient Electromagnetic, providing valuable insights, geophysical applications, providing valuable, subsurface properties

备注

点击查看摘要

Abstract:Transient Electromagnetic (TEM) method is widely used in various geophysical applications, providing valuable insights into subsurface properties. However, time-domain TEM signals are often submerged in various types of noise. While recent deep learning-based denoising models have shown strong performance, these models are mostly trained on simulated or single real-world scenario data, overlooking the significant differences in noise characteristics from different geographical regions. Intuitively, models trained in one environment often struggle to perform well in new settings due to differences in geological conditions, equipment, and external interference, leading to reduced denoising performance. To this end, we propose the Dictionary-driven Prior Regularization Test-time Adaptation (DP-TTA). Our key insight is that TEM signals possess intrinsic physical characteristics, such as exponential decay and smoothness, which remain consistent across different regions regardless of external conditions. These intrinsic characteristics serve as ideal prior knowledge for guiding the TTA strategy, which helps the pre-trained model dynamically adjust parameters by utilizing self-supervised losses, improving denoising performance in new scenarios. To implement this, we customized a network, named DTEMDNet. Specifically, we first use dictionary learning to encode these intrinsic characteristics as a dictionary-driven prior, which is integrated into the model during training. At the testing stage, this prior guides the model to adapt dynamically to new environments by minimizing self-supervised losses derived from the dictionary-driven consistency and the signal one-order variation. Extensive experimental results demonstrate that the proposed method achieves much better performance than existing TEM denoising methods and TTA methods.

84. 【2510.13151】Foveation Improves Payload Capacity in Steganography

链接https://arxiv.org/abs/2510.13151

作者:Lifeng Qiu Lin,Henry Kam,Qi Sun,Kaan Akşit

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:metadata and watermarking, providing metadata, Steganography finds, visual medium, efficient latent representations

备注: SIGGRAPH Asia 2025 Posters Proceedings

点击查看摘要

Abstract:Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.

85. 【2510.13137】Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN

链接https://arxiv.org/abs/2510.13137

作者:Madhumati Pol,Anvay Anturkar,Anushka Khot,Ayush Andure,Aniruddha Ghosh,Anvit Magadum,Anvay Bahadur

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Long Short-Term Memory, American Sign Language, Convolutional Neural, Neural Networks

备注

点击查看摘要

Abstract:This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical this http URL project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.

86. 【2510.13131】OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment

链接https://arxiv.org/abs/2510.13131

作者:Rongjun Chen,Chengsi Yao,Jinchang Ren,Xianxian Zeng,Peixian Wang,Jun Yuan,Jiawen Li,Huimin Zhao,Xu Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Text-image alignment constitutes, multimedia content understanding, semantic correspondences critically, correspondences critically enhances, critically enhances retrieval

备注

点击查看摘要

Abstract:Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8\% (text-to-image) and 40.1\% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.

87. 【2510.13109】VPREG: An Optimal Control Formulation for Diffeomorphic Image Registration Based on the Variational Principle Grid Generation Method

链接https://arxiv.org/abs/2510.13109

作者:Zicong Zhou,Baihan Zhao,Andreas Mang,Guojun Liao

类目:Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:diffeomorphic image registration, paper introduces VPreg, diffeomorphic image, paper introduces, image registration

备注: 30 pages, 9 figures

点击查看摘要

Abstract:This paper introduces VPreg, a novel diffeomorphic image registration method. This work provides several improvements to our past work on mesh generation and diffeomorphic image registration. VPreg aims to achieve excellent registration accuracy while controlling the quality of the registration transformations. It ensures a positive Jacobian determinant of the spatial transformation and provides an accurate approximation of the inverse of the registration, a crucial property for many neuroimaging workflows. Unlike conventional methods, VPreg generates this inverse transformation within the group of diffeomorphisms rather than operating on the image space. The core of VPreg is a grid generation approach, referred to as \emph{Variational Principle} (VP), which constructs non-folding grids with prescribed Jacobian determinant and curl. These VP-generated grids guarantee diffeomorphic spatial transformations essential for computational anatomy and morphometry, and provide a more accurate inverse than existing methods. To assess the potential of the proposed approach, we conduct a performance analysis for 150 registrations of brain scans from the OASIS-1 dataset. Performance evaluation based on Dice scores for 35 regions of interest, along with an empirical analysis of the properties of the computed spatial transformations, demonstrates that VPreg outperforms state-of-the-art methods in terms of Dice scores, regularity properties of the computed transformation, and accuracy and consistency of the provided inverse map. We compare our results to ANTs-SyN, Freesurfer-Easyreg, and FSL-Fnirt.

88. 【2510.13108】DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models

链接https://arxiv.org/abs/2510.13108

作者:Jingyu Song,Zhenxin Li,Shiyi Lan,Xinglong Sun,Nadine Chang,Maying Shen,Joshua Chen,Katherine A. Skinner,Jose M. Alvarez

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:Extended Predictive Driver, Driver Model Score, Predictive Driver Model, Extended Predictive, Predictive Driver

备注: 9 pages, 3 figures

点击查看摘要

Abstract:Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.

89. 【2510.13105】EgoSocial: Benchmarking Proactive Intervention Ability of Omnimodal LLMs via Egocentric Social Interaction Perception

链接https://arxiv.org/abs/2510.13105

作者:Xijun Wang,Tanay Sharma,Achin Kulshrestha,Abhimitra Meka,Aveek Purohit,Dinesh Manocha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understands human social, daily life, technologies become integral, integral to daily, understands human

备注

点击查看摘要

Abstract:As AR/VR technologies become integral to daily life, there's a growing need for AI that understands human social dynamics from an egocentric perspective. However, current LLMs often lack the social awareness to discern when to intervene as AI assistant. This leads to constant, socially unaware responses that may disrupt natural conversation and negatively impact user focus. To address these limitations, we introduce EgoSocial, a large-scale egocentric dataset with 13,500 social video-question pairs, specifically designed to benchmark intervention in social interaction perception. We also present an in-depth analysis of current omnimodal LLMs (OLLMs) to assess their effectiveness in detecting diverse social contextual cues. Experiments show that OLLMs still struggle to detect the intervention timing (14.4% for Gemini 2.5 Pro). We also propose EgoSoD (EgoSocial Detection), an end-to-end method for robustly discerning social dynamics. Informed by our OLLM analysis, EgoSoD integrates multimodal contextual cues (e.g., audio and visual cues) into a social thinking graph, dynamically modeling participants and interactions. Our method proactively detects intervention timing and social interactions, precisely determining when to intervene. Our EgoSoD improves Phi-4 by 45.6% and Gemini 2.5 Pro by 9.9% on Intervention Timing performance, and improves Phi-4 by 20.4% and Gemini 2.5 Pro by 6.9% on overall Social Interaction performance. We will release the dataset and code soon.

90. 【2510.13084】Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation

链接https://arxiv.org/abs/2510.13084

作者:Yi Zuo,Zitao Wang,Lingling Li,Xu Liu,Fang Liu,Licheng Jiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently demonstrated significant, demonstrated significant progress, Toggle, video editing, Code

备注: 32 pages, 11 figures

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality.

Comments:
32 pages, 11 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2510.13084 [cs.CV]

(or
arXiv:2510.13084v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.13084

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yi Zuo [view email] [v1]
Wed, 15 Oct 2025 01:55:32 UTC (1,969 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation, by Yi Zuo and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2025-10

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status
Get status notifications via
email
or slack

91. 【2510.13080】Counting Hallucinations in Diffusion Models

链接https://arxiv.org/abs/2510.13080

作者:Shuai Fu,Jian Zhou,Qi Chen,Huang Jing,Huy Anh Nguyen,Xiaohan Liu,Zhixiong Zeng,Lin Ma,Quanshi Zhang,Qi Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable progress, video synthesis, demonstrated remarkable, Diffusion probabilistic models, generative tasks

备注

点击查看摘要

Abstract:Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.

92. 【2510.13075】Unsupervised Domain Adaptation via Content Alignment for Hippocampus Segmentation

链接https://arxiv.org/abs/2510.13075

作者:Hoda Kalabizadeh,Ludovica Griffanti,Pak-Hei Yeung,Ana I. L. Namburete,Nicola K. Dinsdale,Konstantinos Kamnitsas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning models, population-dependent anatomical characteristics, Deep learning, anatomical characteristics, learning models

备注

点击查看摘要

Abstract:Deep learning models for medical image segmentation often struggle when deployed across different datasets due to domain shifts - variations in both image appearance, known as style, and population-dependent anatomical characteristics, referred to as content. This paper presents a novel unsupervised domain adaptation framework that directly addresses domain shifts encountered in cross-domain hippocampus segmentation from MRI, with specific emphasis on content variations. Our approach combines efficient style harmonisation through z-normalisation with a bidirectional deformable image registration (DIR) strategy. The DIR network is jointly trained with segmentation and discriminator networks to guide the registration with respect to a region of interest and generate anatomically plausible transformations that align source images to the target domain. We validate our approach through comprehensive evaluations on both a synthetic dataset using Morpho-MNIST (for controlled validation of core principles) and three MRI hippocampus datasets representing populations with varying degrees of atrophy. Across all experiments, our method outperforms existing baselines. For hippocampus segmentation, when transferring from young, healthy populations to clinical dementia patients, our framework achieves up to 15% relative improvement in Dice score compared to standard augmentation methods, with the largest gains observed in scenarios with substantial content shift. These results highlight the efficacy of our approach for accurate hippocampus segmentation across diverse populations.

93. 【2510.13067】Direction-aware multi-scale gradient loss for infrared and visible image fusion

链接https://arxiv.org/abs/2510.13067

作者:Kaixuan Yang,Wei Xiang,Zhenshuai Chen,Tong Jin,Yunpeng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visible image fusion, image fusion aims, co-registered source images, integrate complementary information, Infrared and visible

备注

点击查看摘要

Abstract:Infrared and visible image fusion aims to integrate complementary information from co-registered source images to produce a single, informative result. Most learning-based approaches train with a combination of structural similarity loss, intensity reconstruction loss, and a gradient-magnitude term. However, collapsing gradients to their magnitude removes directional information, yielding ambiguous supervision and suboptimal edge fidelity. We introduce a direction-aware, multi-scale gradient loss that supervises horizontal and vertical components separately and preserves their sign across scales. This axis-wise, sign-preserving objective provides clear directional guidance at both fine and coarse resolutions, promoting sharper, better-aligned edges and richer texture preservation without changing model architectures or training protocols. Experiments on open-source model and multiple public benchmarks demonstrate effectiveness of our approach.

94. 【2510.13063】rue Self-Supervised Novel View Synthesis is Transferable

链接https://arxiv.org/abs/2510.13063

作者:Thomas W. Mitchel,Hyunwoo Ryu,Vincent Sitzmann

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:pose representation extracted, view synthesis, key criterion, criterion for determining, representation extracted

备注

点击查看摘要

Abstract:In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

95. 【2510.13046】One Dimensional CNN ECG Mamba for Multilabel Abnormality Classification in 12 Lead ECG

链接https://arxiv.org/abs/2510.13046

作者:Huawei Jiang,Husna Mutahira,Gan Huang,Mannan Saeed Muhammad

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate detection, detection of cardiac, cardiac abnormalities, recordings is regarded, regarded as essential

备注: 6 Pages, 2 figures

点击查看摘要

Abstract:Accurate detection of cardiac abnormalities from electrocardiogram recordings is regarded as essential for clinical diagnostics and decision support. Traditional deep learning models such as residual networks and transformer architectures have been applied successfully to this task, but their performance has been limited when long sequential signals are processed. Recently, state space models have been introduced as an efficient alternative. In this study, a hybrid framework named One Dimensional Convolutional Neural Network Electrocardiogram Mamba is introduced, in which convolutional feature extraction is combined with Mamba, a selective state space model designed for effective sequence modeling. The model is built upon Vision Mamba, a bidirectional variant through which the representation of temporal dependencies in electrocardiogram data is enhanced. Comprehensive experiments on the PhysioNet Computing in Cardiology Challenges of 2020 and 2021 were conducted, and superior performance compared with existing methods was achieved. Specifically, the proposed model achieved substantially higher AUPRC and AUROC scores than those reported by the best previously published algorithms on twelve lead electrocardiograms. These results demonstrate the potential of Mamba-based architectures to advance reliable ECG classification. This capability supports early diagnosis and personalized treatment, while enhancing accessibility in telemedicine and resource-constrained healthcare systems.

96. 【2510.13044】SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

链接https://arxiv.org/abs/2510.13044

作者:Jungbin Cho,Minsu Kim,Jisoo Kim,Ce Zheng,Laszlo A. Jeni,Ming-Hsuan Yang,Youngjae Yu,Seonjoo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Human motion, inherently diverse, diverse and semantically, motion, scene

备注: 15 pages

点击查看摘要

Abstract:Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.

97. 【2510.13042】SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

链接https://arxiv.org/abs/2510.13042

作者:Zhengxu Tang,Zizheng Wang,Luning Wang,Zitao Shuai,Chenhao Zhang,Siyu Qian,Yirui Wu,Bohao Wang,Haosong Rao,Zhenyu Yang,Chenwei Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made significant progress, creating visually appealing, visually appealing videos, made significant, significant progress

备注

点击查看摘要

Abstract:Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to this https URL for more details.

98. 【2510.13016】SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

链接https://arxiv.org/abs/2510.13016

作者:Tanveer Hannan,Shuaicong Wu,Mark Weber,Suprosanna Shit,Jindong Gu,Rajat Koner,Aljoša Ošep,Laura Leal-Taixé,Thomas Seidl

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including embodied agents, autonomous platforms, next-generation AI systems, including embodied, embodied agents

备注

点击查看摘要

Abstract:Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.

99. 【2510.12992】UNCAP: Uncertainty-Guided Planning Using Natural Language Communication for Cooperative Autonomous Vehicles

链接https://arxiv.org/abs/2510.12992

作者:Neel P. Bhatt,Po-han Li,Kushagra Gupta,Rohan Siva,Daniel Milan,Alexander T. Hogue,Sandeep P. Chinchali,David Fridovich-Keil,Zhangyang Wang,Ufuk Topcu

类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:Safe large-scale coordination, multiple cooperative connected, cooperative connected autonomous, efficient and interpretable, connected autonomous vehicles

备注

点击查看摘要

Abstract:Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a vision-language model-based planning approach that enables CAVs to communicate via lightweight natural language messages while explicitly accounting for perception uncertainty in decision-making. UNCAP features a two-stage communication protocol: (i) an ego CAV first identifies the subset of vehicles most relevant for information exchange, and (ii) the selected CAVs then transmit messages that quantitatively express their perception uncertainty. By selectively fusing messages that maximize mutual information, this strategy allows the ego vehicle to integrate only the most relevant signals into its decision-making, improving both the scalability and reliability of cooperative planning. Experiments across diverse driving scenarios show a 63% reduction in communication bandwidth with a 31% increase in driving safety score, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events. Project website: this https URL

100. 【2510.12974】Scope: Selective Cross-modal Orchestration of Visual Perception Experts

链接https://arxiv.org/abs/2510.12974

作者:Tianyu Zhang,Suyuchen Wang,Chao Wang,Juan Rodriguez,Ahmed Masry,Xiangru Jian,Yoshua Bengio,Perouz Taslakian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiplying inference costs, yields diminishing returns, multiple vision encoders, Vision-language models, benefit from multiple

备注: 14 pages, 2 figures

点击查看摘要

Abstract:Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.

101. 【2510.12954】CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

链接https://arxiv.org/abs/2510.12954

作者:Denis Rychkovskiy(DZRobo, Independent Researcher),GPT-5(AI collaborator, OpenAI)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Comfy Adaptive Detail, Adaptive Detail Enhancer, Comfy Adaptive, Detail Enhancer, Adaptive Detail

备注: 8 pages, 3 figures. Endorsed by Dr. Seyedmorteza Sadat (ETH Zurich). The work introduces CADE 2.5 with ZeResFDG as a practical inference-time guidance stack for SD/SDXL. Code and visual examples to be released on GitHub and Hugging Face

点击查看摘要

Abstract:We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.

102. 【2510.12953】Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation

链接https://arxiv.org/abs/2510.12953

作者:Xiao He,Huangxuan Zhao,Guojia Wan,Wei Zhou,Yanxing Liu,Juhua Liu,Yongchao Xu,Yong Luo,Dacheng Tao,Bo Du

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Recent medical vision-language, Recent medical, anomaly detection, shown promise, promise on tasks

备注

点击查看摘要

Abstract:Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: this https URL.

103. 【2510.12931】Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

链接https://arxiv.org/abs/2510.12931

作者:Sanghyun Byun,Jung Ick Guack,Mohanad Odema,Baisub Lee,Jacob Song,Woo Seong Chung

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:achieve remarkable performance, large-scale image-text pretraining, achieve remarkable, image-text pretraining, Unified Vision-Language Alignment

备注: Accepted to PMLR and NeurIPS 2025 UniReps

点击查看摘要

Abstract:Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.

104. 【2510.12909】Robust Plant Disease Diagnosis with Few Target-Domain Samples

链接https://arxiv.org/abs/2510.12909

作者:Takafumi Nogami,Satoshi Kagiwada,Hitoshi Iyatomi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieving impressive performance, deep learning-based systems, plant disease diagnosis, achieving impressive, convenient plant disease

备注: 7 pages, 2 figures. Accepted at the IEEE International Conference on Visual Communications and Image Processing (VCIP) 2025. Extended version

点击查看摘要

Abstract:Various deep learning-based systems have been proposed for accurate and convenient plant disease diagnosis, achieving impressive performance. However, recent studies show that these systems often fail to maintain diagnostic accuracy on images captured under different conditions from the training environment -- an essential criterion for model robustness. Many deep learning methods have shown high accuracy in plant disease diagnosis. However, they often struggle to generalize to images taken in conditions that differ from the training setting. This drop in performance stems from the subtle variability of disease symptoms and domain gaps -- differences in image context and environment. The root cause is the limited diversity of training data relative to task complexity, making even advanced models vulnerable in unseen domains. To tackle this challenge, we propose a simple yet highly adaptable learning framework called Target-Aware Metric Learning with Prioritized Sampling (TMPS), grounded in metric learning. TMPS operates under the assumption of access to a limited number of labeled samples from the target (deployment) domain and leverages these samples effectively to improve diagnostic robustness. We assess TMPS on a large-scale automated plant disease diagnostic task using a dataset comprising 223,073 leaf images sourced from 23 agricultural fields, spanning 21 diseases and healthy instances across three crop species. By incorporating just 10 target domain samples per disease into training, TMPS surpasses models trained using the same combined source and target samples, and those fine-tuned with these target samples after pre-training on source data. It achieves average macro F1 score improvements of 7.3 and 3.6 points, respectively, and a remarkable 18.7 and 17.1 point improvement over the baseline and conventional metric learning.

105. 【2510.12904】State-Change Learning for Prediction of Future Events in Endoscopic Videos

链接https://arxiv.org/abs/2510.12904

作者:Saurav Sharma,Chinedu Innocent Nwoye,Didier Mutter,Nicolas Padoy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:operating room safety, driven by real-time, safety and efficiency, real-time AI analysis, critical for operating

备注: 24 pages, 13 figures

点击查看摘要

Abstract:Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.

106. 【2510.12901】SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms

链接https://arxiv.org/abs/2510.12901

作者:Haithem Turki,Qi Wu,Xin Kang,Janick Martinez Esturo,Shengyu Huang,Ruilong Li,Zan Gojcic,Riccardo de Lutio

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Rigorous testing, self-driving vehicles, essential to ensure, ensure their safety, camera models

备注: Project page: [this https URL](https://research.nvidia.com/labs/sil/projects/simuli)

点击查看摘要

Abstract:Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.

107. 【2510.12866】Learning to Grasp Anything by Playing with Random Toys

链接https://arxiv.org/abs/2510.12866

作者:Dantong Niu,Yuvan Sharma,Baifeng Shi,Rachel Ding,Matteo Gioia,Haoru Xue,Henry Tsai,Konstantinos Kallidromitis,Anirudh Pai,Shankar Shastry,Trevor Darrell,Jitendra Malik,Roei Herzig

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Robotic manipulation policies, policies often struggle, struggle to generalize, manipulation policies, Robotic manipulation

备注

点击查看摘要

Abstract:Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation. Demonstration videos, code, checkpoints and our dataset are available on our project page: this https URL .

108. 【2510.12845】VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

链接https://arxiv.org/abs/2510.12845

作者:Jesse Atuhurra,Iqra Ali,Tomoya Iwakura,Hidetaka Kamigaito,Tatsuya Hiraoka

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Vision Language Models, Vision Language, pivotal for advancing, advancing perception, predominantly English-centric benchmarks

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.

109. 【2510.13714】Dedelayed: Deleting remote inference delay via on-device correction

链接https://arxiv.org/abs/2510.13714

作者:Dan Jacobellis,Mateen Ulhaq,Fabien Racapé,Hyomin Choi,Neeraja J. Yadwadkar

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:leverage powerful cloud, powerful cloud models, leverage powerful, powerful cloud, Remote inference

备注

点击查看摘要

Abstract:Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.

110. 【2510.13562】An efficient approach with theoretical guarantees to simultaneously reconstruct activity and attenuation sinogram for TOF-PET

链接https://arxiv.org/abs/2510.13562

作者:Liyang Hu,Chong Chen

类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词:quantitatively accurate activity, perform attenuation correction, positron emission tomography, accurate activity map, attenuation correction factors

备注: 32 pages, 11 figures, 4 tables

点击查看摘要

Abstract:In positron emission tomography (PET), it is indispensable to perform attenuation correction in order to obtain the quantitatively accurate activity map (tracer distribution) in the body. Generally, this is carried out based on the estimated attenuation map obtained from computed tomography or magnetic resonance imaging. However, except for errors in the attenuation correction factors obtained, the additional scan not only brings in new radiation doses and/or increases the scanning time but also leads to severe misalignment induced by various motions during and between the two sequential scans. To address these issues, based on maximum likelihood estimation, we propose a new mathematical model for simultaneously reconstructing the activity and attenuation sinogram from the time-of-flight (TOF)-PET emission data only. Particularly, we make full use of the exclusively exponential form for the attenuation correction factors, and consider the constraint of a total amount of the activity in some mask region in the proposed model. Furthermore, we prove its well-posedness, including the existence, uniqueness and stability of the solution. We propose an alternating update algorithm to solve the model, and also analyze its convergence. Finally, numerical experiments with various TOF-PET emission data demonstrate that the proposed method is of numerical convergence and robust to noise, and outperforms some state-of-the-art methods in terms of accuracy and efficiency, and has the capability of autonomous attenuation correction.

111. 【2510.13441】Steerable Conditional Diffusion for Domain Adaptation in PET Image Reconstruction

链接https://arxiv.org/abs/2510.13441

作者:George Webber,Alexander Hammers,Andrew P. King,Andrew J. Reader

类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:positron emission tomography, image training data, diffusion model, recently enabled, emission tomography

备注: Accepted for oral presentation at IEEE NSS MIC RTSD 2025 (submitted May 2025; accepted July 2025; to be presented Nov 2025)

点击查看摘要

Abstract:Diffusion models have recently enabled state-of-the-art reconstruction of positron emission tomography (PET) images while requiring only image training data. However, domain shift remains a key concern for clinical adoption: priors trained on images from one anatomy, acquisition protocol or pathology may produce artefacts on out-of-distribution data. We propose integrating steerable conditional diffusion (SCD) with our previously-introduced likelihood-scheduled diffusion (PET-LiSch) framework to improve the alignment of the diffusion model's prior to the target subject. At reconstruction time, for each diffusion step, we use low-rank adaptation (LoRA) to align the diffusion model prior with the target domain on the fly. Experiments on realistic synthetic 2D brain phantoms demonstrate that our approach suppresses hallucinated artefacts under domain shift, i.e. when our diffusion model is trained on perturbed images and tested on normal anatomy, our approach suppresses the hallucinated structure, outperforming both OSEM and diffusion model baselines qualitatively and quantitatively. These results provide a proof-of-concept that steerable priors can mitigate domain shift in diffusion-based PET reconstruction and motivate future evaluation on real data.