本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新474篇论文，其中：

自然语言处理45篇
信息检索9篇
计算机视觉120篇

自然语言处理

1. 【2511.16671】hinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

链接：https://arxiv.org/abs/2511.16671

作者：Ziyu Guo,Renrui Zhang,Hongyu Li,Manyuan Zhang,Xinyan Chen,Sifan Wang,Yan Feng,Peng Pei,Pheng-Ann Heng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Recent advances, increasingly explored, explored the integration, textual reasoning, generation process

备注： Project Page: [this https URL](https://think-while-gen.github.io) Code: [this https URL](https://github.com/ZiyuGuo99/Thinking-while-Generating)

点击查看摘要

Abstract:Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: this https URL.

2. 【2511.16664】Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

链接：https://arxiv.org/abs/2511.16664

作者：Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Ruisi Cai,Marcin Chochowski,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov

类目：Computation and Language (cs.CL)

关键词：requiring separate training, targeting multiple scales, separate training runs, large language models, language models targeting

备注：

点击查看摘要

Abstract:Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.

3. 【2511.16654】Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

链接：https://arxiv.org/abs/2511.16654

作者：Elias Lumer,Alex Cardenas,Matt Melich,Myles Mason,Sara Dieter,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,Roberto Hernandez

类目：Computation and Language (cs.CL)

关键词：enabled Large Language, Large Language Models, Large Language, multimodal RAG systems, Recent advancements

备注：

点击查看摘要

Abstract:Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.

4. 【2511.16635】SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

链接：https://arxiv.org/abs/2511.16635

作者：Guolin Huang,Wenting Chen,Jiaqi Yang,Xinheng Lyu,Xiaoling Luo,Sen Yang,Xiaohan Xing,Linlin Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：existing methods lack, treatment planning, clinical adoption, critical for cancer, cancer prognosis

备注： 20 pages

点击查看摘要

Abstract:Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.

5. 【2511.16595】meViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

链接：https://arxiv.org/abs/2511.16595

作者：Boshen Xu,Zihan Xiao,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Qin Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：vision-language model designed, designed to tackle, tackle challenges, Processing long videos, long videos demands

备注： Project page: [this https URL](https://xuboshen.github.io/TimeViper)

点击查看摘要

Abstract:We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

6. 【2511.16590】D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

链接：https://arxiv.org/abs/2511.16590

作者：Sen Chen,Tong Zhao,Yi Bin,Fei Ma,Wenqi Shao,Zheng Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Graphical User Interfaces, Artificial General Intelligence, Developing intelligent agents, User Interfaces, General Intelligence

备注： Accepted to AAAI 2026

点击查看摘要

Abstract:Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.

7. 【2511.16577】Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

链接：https://arxiv.org/abs/2511.16577

作者：Kexin Zhao,Ken Forbus

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Word sense disambiguation, Word sense, fundamental challenge, natural language understanding, Word

备注： 16 pages

点击查看摘要

Abstract:Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.

8. 【2511.16544】WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

链接：https://arxiv.org/abs/2511.16544

作者：Zachary Ellis,Jared Joselowitz,Yash Deo,Yajie He,Anna Kalygina,Aisling Higham,Mana Rahimzadeh,Yan Jia,Ibrahim Habli,Ernest Lim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Automatic Speech Recognition, Word Error Rate, Speech Recognition, Automatic Speech, Error Rate

备注：

点击查看摘要

Abstract:As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $\kappa$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

9. 【2511.16543】he Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation

链接：https://arxiv.org/abs/2511.16543

作者：Jiaheng Zhang,Daqiang Zhang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, integration of Large, Large Language, suboptimal compromises, systems often leads

备注： 11 pages,3 figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.

Comments:
11 pages,3 figures

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2511.16543 [cs.IR]

(or
arXiv:2511.16543v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2511.16543

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

10. 【2511.16540】Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

链接：https://arxiv.org/abs/2511.16540

作者：Éloïse Benito-Rodriguez,Einar Urdshals,Jasmina Nasufi,Nicky Pochinkov

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Understanding Large Language, Large Language Models, Understanding Large, Large Language, Language Models

备注： 13 pages, 5 figures

点击查看摘要

Abstract:Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

11. 【2511.16528】urkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

链接：https://arxiv.org/abs/2511.16528

作者：Özay Ezerceli,Mahmoud El Hussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu,Yusuf Çelebi,Yağız Asker

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Neural information retrieval, Neural information, lower-resource languages, information retrieval systems, retrieval systems excel

备注：

点击查看摘要

Abstract:Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

12. 【2511.16518】MiMo-Embodied: X-Embodied Foundation Model Technical Report

链接：https://arxiv.org/abs/2511.16518

作者：Xiaoshuai Hao,Lei Zhou,Zhijian Huang,Zhiwen Hou,Yingbo Tang,Lingfeng Zhang,Guang Li,Zheng Lu,Shuhuai Ren,Xianhui Meng,Yuchen Zhang,Jing Wu,Jinghui Lu,Chenxu Dang,Jiayi Guan,Jianhua Wu,Zhiyi Hou,Hanbing Li,Shumeng Xia,Mingliang Zhou,Yinan Zheng,Zihao Yue,Shuhao Gu,Hao Tian,Yuannan Shen,Jianwei Cui,Wen Zhang,Shaoqing Xu,Bing Wang,Haiyang Sun,Zeyu Zhu,Yuncheng Jiang,Zibin Guo,Chuhong Gong,Chaofan Zhang,Wenbo Ding,Kun Ma,Guang Chen,Rui Cai,Diyun Xiang,Heng Qu,Fuli Luo,Hangjun Ye,Long Chen

类目：Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving benchmarks, Autonomous Driving, cross-embodied foundation model, Driving Planning, integrate and achieve

备注： Code: [this https URL](https://github.com/XiaomiMiMo/MiMo-Embodied) Model: [this https URL](https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B)

点击查看摘要

Abstract:We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at this https URL.

13. 【2511.16478】Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation

链接：https://arxiv.org/abs/2511.16478

作者：Elena V. Epure,Yashar Deldjoo,Bruno Sguerra,Markus Schedl,Manuel Moussallam

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：information-retrieval framing, retrieval-oriented subtasks, long relied, progress is measured, Music Recommender Systems

备注： Under review with the ACM Transactions on Recommender Systems (TORS)

点击查看摘要

Abstract:Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.

Comments:
Under review with the ACM Transactions on Recommender Systems (TORS)

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL)

Cite as:
arXiv:2511.16478 [cs.IR]

(or
arXiv:2511.16478v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2511.16478

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

14. 【2511.16470】Arctic-Extract Technical Report

链接：https://arxiv.org/abs/2511.16470

作者：Mateusz Chiliński,Julita Ołtusek,Wojciech Jaśkowski

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：extracting structural data, question answering, entities and tables, structural data, digital-born business documents

备注：

点击查看摘要

Abstract:Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.

15. 【2511.16467】Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

链接：https://arxiv.org/abs/2511.16467

作者：Andrew Gomes

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：transformer-based language models, discovery and analysis, idiomatic expressions, expressions in transformer-based, set of techniques

备注：

点击查看摘要

Abstract:We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.

16. 【2511.16438】ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

链接：https://arxiv.org/abs/2511.16438

作者：Sherine George,Nithish Saji

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：corporate sustainability reports, assess explainable ESG, ESG question answering, explainable ESG question, evaluation framework designed

备注： Workshop paper accepted at AI4DF 2025 (part of ACM ICAIF 2025). 3 pages including tables and figures

点击查看摘要

Abstract:We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.

17. 【2511.16423】OFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

链接：https://arxiv.org/abs/2511.16423

作者：Li Zhang,Zhongxuan Han,XiaoHua Feng,Jiaming Zhang,Yuyuan Li,Linbo Jiang,Jianan Lin,Chaochao Chen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：rapidly emerging research, emerging research topic, downstream tasks, tasks through collaborative, collaborative interactions

备注： Accepted by AAAI 2026

点击查看摘要

Abstract:Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.

18. 【2511.16416】Classification of worldwide news articles by perceived quality, 2018-2024

链接：https://arxiv.org/abs/2511.16416

作者：Connor McElroy,Thiago E. A. de Oliveira,Chris Brogly

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：deep learning models, machine learning classifiers, deep learning, context length, distinguish perceived lower-quality

备注：

点击查看摘要

Abstract:This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.

19. 【2511.16397】AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

链接：https://arxiv.org/abs/2511.16397

作者：Ren Ma,Jiantao Qiu,Chao Xu,Pei Chu,Kaiwen Liu,Pengli Ren,Yuan Qu,Jiahui Peng,Linfeng Hou,Mengjie Liu,Lindong Lu,Wenchang Ning,Jia Yu,Rui Min,Jin Shi,Haojiong Chen,Peng Zhang,Wenjian Zhang,Qian Jiang,Zengjie Hu,Guoqiang Yang,Zhenxiang Li,Fukai Shang,Zhongying Tu,Wentao Zhang,Dahua Lin,Conghui He

类目：Computation and Language (cs.CL)

关键词：fixed pre-processing step, curation efforts focus, pre-processing step, crucial for large, curation efforts

备注：

点击查看摘要

Abstract:While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

20. 【2511.16353】Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

链接：https://arxiv.org/abs/2511.16353

作者：Jonathan Kamp,Lisa Beinborn,Antske Fokkens

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Human explanations, natural language, form a tool, dataset-specific shortcuts, explanations of natural

备注： Long paper accepted to the main conference of AACL 2025. Please cite the conference proceedings when available

点击查看摘要

Abstract:Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

21. 【2511.16345】NLP Datasets for Idiom and Figurative Language Tasks

链接：https://arxiv.org/abs/2511.16345

作者：Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen,Stephanie Reynolds

类目：Computation and Language (cs.CL)

关键词：figurative language, figurative language form, Natural Language Processing, language, figurative language expressions

备注： 32 pages, 10 figures

点击查看摘要

Abstract:Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.

22. 【2511.16334】OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

链接：https://arxiv.org/abs/2511.16334

作者：Kaichen Zhang,Keming Wu,Zuhao Yang,Kairui Hu,Bin Wang,Ziwei Liu,Xingxuan Li,Lidong Bing

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：fueled growing interest, Recent advancements, large reasoning models, advancements in large, models have fueled

备注：

点击查看摘要

Abstract:Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at this https URL.

23. 【2511.16331】Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

链接：https://arxiv.org/abs/2511.16331

作者：Jiashu Yao,Heyan Huang,Shuang Zeng,Chuwei Luo,WangJie You,Jie Tang,Qingsong Liu,Yuhang Guo,Yangyang Kang

类目：Computation and Language (cs.CL)

关键词：scaled inference computation, demonstrated substantial success, outcome correctness rewards, internal reasoning quality, reinforcement learning

备注： Accepted to AAAI 2026

点击查看摘要

Abstract:Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.

24. 【2511.16324】SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

链接：https://arxiv.org/abs/2511.16324

作者：Wei Xia,Zhi-Hong Deng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, increasingly widespread, rapid advancement, advancement of large, large language

备注：

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

25. 【2511.16275】SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

链接：https://arxiv.org/abs/2511.16275

作者：Xingtao Zhao,Hao Peng,Dingli Su,Xianghua Zeng,Chunyang Liu,Jinzhi Liao,Philip S. Yu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Reliable uncertainty quantification, avoiding hallucinating falsehoods, deploying large language, large language models, semantic structural information

备注： 14 pages of main text and 10 pages of appendices

点击查看摘要

Abstract:Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.

26. 【2511.16221】Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

链接：https://arxiv.org/abs/2511.16221

作者：Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Ruicong Liu,Yoichi Sato

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Interactive Deception Assessment, Large Language Models, Multimodal Interactive Deception, complex social interactions

备注：

点击查看摘要

Abstract:Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.

27. 【2511.16209】PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization

链接：https://arxiv.org/abs/2511.16209

作者：Huseein Jawad,Nicolas Brunel

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, behavior of Large, Language Models, sensitive information

备注：

点击查看摘要

Abstract:System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks, while simultaneously preserving task utility above a specified threshold, measured by semantic fidelity to baseline outputs. This black-box, optimization-driven methodology is lightweight and practical, requiring only API access to the target and optimizer LLMs. We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks, outperforming established baseline defenses without compromising the model's intended functionality. Our work presents a paradigm for developing robust, utility-aware defenses in the escalating landscape of LLM security. The code is made public on the following link: this https URL

28. 【2511.16198】SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

链接：https://arxiv.org/abs/2511.16198

作者：Sebastian Haan

类目：Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词：Effective scientific communication, scientific communication depends, Effective scientific, supporting evidence, scientific communication

备注： 21 pages, 4 figures

点击查看摘要

Abstract:Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.

29. 【2511.16147】S-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

链接：https://arxiv.org/abs/2511.16147

作者：Dabiao Ma,Ziming Dai,Zhimin Xin,Shu Wang,Ye Wang,Haojun Fei

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：natural language processing, pretrained weights fixed, language processing, computer vision, weights fixed

备注： 11 pages, 3 figures

点击查看摘要

Abstract:In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.

30. 【2511.16122】ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

链接：https://arxiv.org/abs/2511.16122

作者：Qing Zhang,Bing Xu,Xudong Zhang,Yifan Shi,Yang Li,Chen Zhang,Yik Chung Wu,Ngai Wong,Yijie Chen,Hong Dai,Xiansen Chen,Mian Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, highly relies, Prompt Optimization

备注：

点击查看摘要

Abstract:The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.

31. 【2511.16072】Early science acceleration experiments with GPT-5

链接：https://arxiv.org/abs/2511.16072

作者：Sébastien Bubeck,Christian Coester,Ronen Eldan,Timothy Gowers,Yin Tat Lee,Alexandru Lupsasca,Mehtaab Sawhney,Robert Scherrer,Mark Sellke,Brian K. Spears,Derya Unutmaz,Kevin Weil,Steven Yin,Nikita Zhivotovskiy

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：increasingly valuable tool, tool for scientists, increasingly valuable, valuable tool, remain unaware

备注： 89 pages

点击查看摘要

Abstract:AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

32. 【2511.16054】Learning Tractable Distributions Of Language Model Continuations

链接：https://arxiv.org/abs/2511.16054

作者：Gwen Yidou-Weng,Ian Li,Anji Liu,Oliver Broadrick,Guy Van den Broeck,Benjie Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Controlled language generation, generation conditions text, language generation conditions, Controlled language, generation conditions

备注：

点击查看摘要

Abstract:Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.

33. 【2511.16035】Liars' Bench: Evaluating Lie Detectors for Language Models

链接：https://arxiv.org/abs/2511.16035

作者：Kieron Kretschmar(1),Walter Laurito(1 and 2),Sharan Maiya(1 and 3),Samuel Marks(4) ((1) Cadenza Labs, (2) FZI, (3) University of Cambridge, (4) Anthropic)

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, generating statements, work has introduced, detecting when large, large language

备注： *Kieron Kretschmar and Walter Laurito contributed equally to this work. 10 pages, 2 figures; plus appendix. Code at [this https URL](https://github.com/Cadenza-Labs/liars-bench) and datasets at [this https URL](https://huggingface.co/datasets/Cadenza-Labs/liars-bench) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

点击查看摘要

Abstract:Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

34. 【2511.16018】SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model

链接：https://arxiv.org/abs/2511.16018

作者：Emanuel C. Silva,Emily S. M. Salum,Gabriel M. Arantes,Matheus P. Pereira,Vinicius F. Oliveira,Alessandro L. Bicho

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Artificial Intelligence, application of Artificial, dynamic content generation, evolved significantly, allowing for dynamic

备注： Published in Anais Estendidos do XXIV Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames 2025)

点击查看摘要

Abstract:Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.

35. 【2511.15996】QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation

链接：https://arxiv.org/abs/2511.15996

作者：Amin Bigdeli,Radin Hamidi Rad,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：large language model, based query reformulation, supports large language, language model, large language

备注： 4 pages

点击查看摘要

Abstract:We present QueryGym, a lightweight, extensible Python toolkit that supports large language model (LLM)-based query reformulation. This is an important tool development since recent work on llm-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing llm-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at this https URL.

36. 【2511.15994】CARE-RAG - Clinical Assessment and Reasoning in RAG

链接：https://arxiv.org/abs/2511.15994

作者：Deepthi Potluri,Aby Mammen Mathew,Jeffrey B DeWitt,Alexander L. Rasgon,Yide Hao,Junyuan Hong,Ying Ding

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, Written Exposure Therapy, guarantee that large, large language, Access

备注： The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

点击查看摘要

Abstract:Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.

37. 【2511.15976】OD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

链接：https://arxiv.org/abs/2511.15976

作者：Sarik Ghazarian,Abhinav Gullapalli,Swair Shah,Anurag Beniwal,Nanyun Peng,Narayanan Sadagopan,Zhou Yu

类目：Computation and Language (cs.CL)

关键词：real-world task-oriented dialogue, task-oriented dialogue, agents are required, real-world task-oriented, required to strictly

备注：

点击查看摘要

Abstract:In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.

38. 【2511.15958】JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

链接：https://arxiv.org/abs/2511.15958

作者：Zhenyu Bi,Gaurav Srivastava,Yang Li,Meng Lu,Swastik Roy,Morteza Ziyadi,Xuan Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：remains unclear compared, answers remains unclear, small language models, large language models, small language

备注： 23 pages, 4 figures

点击查看摘要

Abstract:While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

39. 【2511.15915】AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

链接：https://arxiv.org/abs/2511.15915

作者：Genghan Zhang,Shaowei Zhu,Anjiang Wei,Zhenyu Song,Allen Nie,Zhen Jia,Nandita Vijaykumar,Yida Wang,Kunle Olukotun

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：self-improving large language, hardware-specific optimization knowledge, expert-provided hardware-specific optimization, large language model, autonomously optimizes kernels

备注：

点击查看摘要

Abstract:We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.

40. 【2511.15887】Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

链接：https://arxiv.org/abs/2511.15887

作者：Seungbeen Lee,Jinhong Jeong,Donghyun Kim,Yejin Son,Youngjae Yu

类目：Computation and Language (cs.CL)

关键词：interpret others' mental, others' mental states, social cohesion, ability to interpret, interpret others'

备注：

点击查看摘要

Abstract:Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.

41. 【2511.15886】What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

链接：https://arxiv.org/abs/2511.15886

作者：Jeremias Ferrao,Ezgi Basar,Khondoker Ittehadul Islam,Mahrokh Hassani

类目：Computation and Language (cs.CL)

关键词：attribution patterns underlying, patterns underlying, study investigates, CoT prompting, multilingual LLMs

备注： Received the Best Student Project Award at RuG's Advanced-NLP course

点击查看摘要

Abstract:This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

42. 【2511.15862】he Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems

链接：https://arxiv.org/abs/2511.15862

作者：Devang Kulshreshtha,Wanyu Du,Raghav Jain,Srikanth Doss,Hang Su,Sandesh Swamy,Yanjun Qi

类目：Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词：paper introduces, simulating and analyzing, uncooperative behaviors, uncooperative, LLM-based multi-agent systems

备注：

点击查看摘要

Abstract:This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.

43. 【2511.15848】Step-Audio-R1 Technical Report

链接：https://arxiv.org/abs/2511.15848

作者：Fei Tian,Xiangyu Tony Zhang,Yuxin Zhang,Haoyang Zhang,Yuxin Li,Daijiao Liu,Yayue Deng,Donghang Wu,Jun Chen,Liang Zhao,Chengyuan Yao,Hexin Liu,Eng Siong Chng,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词：demonstrated remarkable success, Recent advances, reasoning, demonstrated remarkable, remarkable success

备注： 15 pages, 5 figures. Technical Report

点击查看摘要

Abstract:Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

44. 【2511.15719】Chain of Summaries: Summarization Through Iterative Questioning

链接：https://arxiv.org/abs/2511.15719

作者：William Brach,Lukas Galke Poech

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, external web content, increasingly using external

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly using external web content. However, much of this content is not easily digestible by LLMs due to LLM-unfriendly formats and limitations of context length. To address this issue, we propose a method for generating general-purpose, information-dense summaries that act as plain-text repositories of web content. Inspired by Hegel's dialectical method, our approach, denoted as Chain of Summaries (CoS), iteratively refines an initial summary (thesis) by identifying its limitations through questioning (antithesis), leading to a general-purpose summary (synthesis) that can satisfy current and anticipate future information needs. Experiments on the TriviaQA, TruthfulQA, and SQUAD datasets demonstrate that CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods such as BRIO and PEGASUS by up to 27%. CoS-generated summaries yield higher QA performance compared to the source content, while requiring substantially fewer tokens and being agnostic to the specific downstream LLM. CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.

45. 【2511.16639】Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

链接：https://arxiv.org/abs/2511.16639

作者：Wei-Cheng Tseng,David Harwath

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词：Recent advancements, speech synthesis techniques, superior audio compression, enabled superior audio, synthesis techniques

备注： To be presented at ASRU 2025

点击查看摘要

Abstract:Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.

信息检索

1. 【2511.16576】PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search

链接：https://arxiv.org/abs/2511.16576

作者：Alima Subedi,Sankalpa Pokharel,Satish Puri

类目：Information Retrieval (cs.IR)

关键词：critical task, nearest neighbor searches, data mining, neighbor searches quickly, nearest neighbor

备注：

点击查看摘要

Abstract:Similarity searches are a critical task in data mining. As data sets grow larger, exact nearest neighbor searches quickly become unfeasible, leading to the adoption of approximate nearest neighbor (ANN) searches. ANN has been studied for text data, images, and trajectories. However, there has been little effort to develop ANN systems for polygons in spatial database systems and geographic information systems. We present PolyMinHash, a system for approximate polygon similarity search that adapts MinHashing into a novel 2D polygon-hashing scheme to generate short, similarity-preserving signatures of input polygons. Minhash is generated by counting the number of randomly sampled points needed before the sampled point lands within the polygon's interior area, yielding hash values that preserve area-based Jaccard similarity. We present the tradeoff between search accuracy and runtime of our PolyMinHash system. Our hashing mechanism reduces the number of candidates to be processed in the query refinement phase by up to 98% compared to the number of candidates processed by the brute-force algorithm.

2. 【2511.16543】he Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation

链接：https://arxiv.org/abs/2511.16543

作者：Jiaheng Zhang,Daqiang Zhang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, integration of Large, Large Language, suboptimal compromises, systems often leads

备注： 11 pages,3 figures

点击查看摘要

Comments:
11 pages,3 figures

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2511.16543 [cs.IR]

(or
arXiv:2511.16543v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2511.16543

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

3. 【2511.16528】urkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

链接：https://arxiv.org/abs/2511.16528

作者：Özay Ezerceli,Mahmoud El Hussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu,Yusuf Çelebi,Yağız Asker

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Neural information retrieval, Neural information, lower-resource languages, information retrieval systems, retrieval systems excel

备注：

点击查看摘要

4. 【2511.16478】Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation

链接：https://arxiv.org/abs/2511.16478

作者：Elena V. Epure,Yashar Deldjoo,Bruno Sguerra,Markus Schedl,Manuel Moussallam

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：information-retrieval framing, retrieval-oriented subtasks, long relied, progress is measured, Music Recommender Systems

备注： Under review with the ACM Transactions on Recommender Systems (TORS)

点击查看摘要

Comments:
Under review with the ACM Transactions on Recommender Systems (TORS)

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL)

Cite as:
arXiv:2511.16478 [cs.IR]

(or
arXiv:2511.16478v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2511.16478

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

5. 【2511.16438】ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

链接：https://arxiv.org/abs/2511.16438

作者：Sherine George,Nithish Saji

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：corporate sustainability reports, assess explainable ESG, ESG question answering, explainable ESG question, evaluation framework designed

备注： Workshop paper accepted at AI4DF 2025 (part of ACM ICAIF 2025). 3 pages including tables and figures

点击查看摘要

6. 【2511.16414】An Efficient LLM-based Evolutional Recommendation with Locate-Forget-Update Paradigm

链接：https://arxiv.org/abs/2511.16414

作者：Hao Liu,Le Wu,Min Hou,Han Wu,Kun Zhang,Xin Li,Si Wei

类目：Information Retrieval (cs.IR)

关键词：LLM-based recommender systems, Large Language Models, Large Language, LLM-based recommender, shown exceptional performance

备注：

点击查看摘要

Abstract:Nowadays, Large Language Models (LLMs) have shown exceptional performance in sequential recommendations, and the adoption of LLM-based recommender systems (LLMRec) is becoming increasingly widespread in existing e-commerce platforms. Despite the impressive performance, the constant high volume of new user-item interactions makes it difficult to adapt to the evolution of user preference over time, especially for LLM-based recommender systems. The challenge arises from the large number of parameters in LLMs, which makes traditional evolution methods (i.e., Re-training or Fine-tuning) impractical. Specifically, Re-training with all interactions results in prohibitively high computational costs. On the other hand, fine-tuning with only new interactions leads to preference forgetting among inactive users, ultimately compromising overall performance. To tackle this problem, we propose EvoRec, an efficient Locate-Forget-Update framework designed for LLM-based recommender systems to model the evolution of user preferences. EvoRec identifies a small set of parameters associated with preference changes and updates them precisely, thereby saving computational resources while maintaining strong recommendation performance. Notably, the modified parameters account for only 30\% of LoRA adapter parameters, with no additional parameters introduced. Extensive experiments on two real-world datasets demonstrate that, compared to existing methods, EvoRec not only efficiently evolves LLMRec to adapt to the preferences of active users, but also preserves the interests of inactive users from being disturbed during evolution.

7. 【2511.16326】ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning

链接：https://arxiv.org/abs/2511.16326

作者：Jiawei Zhou,Hang Ding,Haiyun Jiang

类目：Information Retrieval (cs.IR)

关键词：Retrieval-Augmented Generation, knowledge-intensive tasks, crucial evidence, sparse yet crucial, Generation

备注： Under Review in ARR

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for knowledge-intensive tasks, yet its effectiveness in long-context scenarios is often bottlenecked by the retriever's inability to distinguish sparse yet crucial evidence. Standard retrievers, optimized for query-document similarity, frequently fail to align with the downstream goal of generating a precise answer. To bridge this gap, we propose a novel fine-tuning framework that optimizes the retriever for Answer Alignment. Specifically, we first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer. We then employ a curriculum-based contrastive learning scheme to fine-tune the retriever. This curriculum leverages LLM-constructed Knowledge Graphs (KGs) to generate augmented queries, which in turn mine progressively challenging hard negatives. This process trains the retriever to distinguish the answer-sufficient positive chunks from these nuanced distractors, enhancing its generalization. Extensive experiments on 10 datasets from the Ultradomain and LongBench benchmarks demonstrate that our fine-tuned retriever achieves state-of-the-art performance, improving 14.5% over the base model without substantial architectural modifications and maintaining strong efficiency for long-context RAG. Our work presents a robust and effective methodology for building truly answer-centric retrievers.

8. 【2511.16106】Incorporating Token Importance in Multi-Vector Retrieval

链接：https://arxiv.org/abs/2511.16106

作者：Archish S,Ankit Garg,Kirankumar Shiragur,Neeraj Kayal

类目：Information Retrieval (cs.IR)

关键词：independently encodes queries, token-level vector representations, independently encodes, encodes queries, computes similarity

备注：

点击查看摘要

Abstract:ColBERT introduced a late interaction mechanism that independently encodes queries and documents using BERT, and computes similarity via fine-grained interactions over token-level vector representations. This design enables expressive matching while allowing efficient computation of scores, as the multi-vector document representations could be pre-computed offline. ColBERT models distance using a Chamfer-style function: for each query token, it selects the closest document token and sums these distances across all query tokens. In our work, we explore enhancements to the Chamfer distance function by computing a weighted sum over query token contributions, where weights reflect the token importance. Empirically, we show that this simple extension, requiring only token-weight training while keeping the multi-vector representations fixed, further enhances the expressiveness of late interaction multi-vector mechanism. In particular, on the BEIR benchmark, our method achieves an average improvement of 1.28\% in Recall@10 in the zero-shot setting using IDF-based weights, and 3.66\% through few-shot fine-tuning.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2511.16106 [cs.IR]

(or
arXiv:2511.16106v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2511.16106

Focus to learn more

              arXiv-issued DOI via DataCite</p>

9. 【2511.15996】QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation

链接：https://arxiv.org/abs/2511.15996

作者：Amin Bigdeli,Radin Hamidi Rad,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：large language model, based query reformulation, supports large language, language model, large language

备注： 4 pages

点击查看摘要

计算机视觉

1. 【2511.16674】Dataset Distillation for Pre-Trained Self-Supervised Vision Models

链接：https://arxiv.org/abs/2511.16674

作者：George Cazenavette,Antonio Torralba,Vincent Sitzmann

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：dataset distillation aims, aims to find, find a small, small set, reproduces the performance

备注： Accepted at NeurIPS 2025. Project page: [this https URL](https://linear-gradient-matching.github.io/) Code: [this https URL](https://github.com/GeorgeCazenavette/linear-gradient-matching)

点击查看摘要

Abstract:The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

2. 【2511.16673】NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

链接：https://arxiv.org/abs/2511.16673

作者：Jing Wen,Alexander G. Schwing,Shenlong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recovering an animatable, set of images, sparse set, images, human

备注： NeurIPS'25; project page: [this https URL](https://wenj.github.io/NoPo-Avatar/)

点击查看摘要

Abstract:We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

3. 【2511.16672】EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

链接：https://arxiv.org/abs/2511.16672

作者：Omkat Thawakar,Shravan Venkatraman,Ritesh Thawkar,Abdelrahman Shaker,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Khan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：externally verified reward, enabled impressive reasoning, Recent advances, existing training pipelines, perception abilities

备注： 9 Pages, 6 Figures, 4 Tables

点击查看摘要

Abstract:Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at this https URL.

4. 【2511.16671】hinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

链接：https://arxiv.org/abs/2511.16671

作者：Ziyu Guo,Renrui Zhang,Hongyu Li,Manyuan Zhang,Xinyan Chen,Sifan Wang,Yan Feng,Peng Pei,Pheng-Ann Heng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Recent advances, increasingly explored, explored the integration, textual reasoning, generation process

备注： Project Page: [this https URL](https://think-while-gen.github.io) Code: [this https URL](https://github.com/ZiyuGuo99/Thinking-while-Generating)

点击查看摘要

5. 【2511.16670】Learning to Think Fast and Slow for Visual Language Models

链接：https://arxiv.org/abs/2511.16670

作者：Chenyu Lin,Cheng Chi,Jinlin Wu,Sharon Li,Kaiyang Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex problems, confronted with complex, thinking, Abstract, conversely

备注：

点击查看摘要

Abstract:When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

6. 【2511.16669】Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

链接：https://arxiv.org/abs/2511.16669

作者：Junhao Cheng,Liang Hou,Xin Tao,Jing Liao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely confined, real-world applications, confined to entertainment, largely confined, video

备注： Project page: [this https URL](https://video-as-answer.github.io/)

点击查看摘要

Abstract:While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in this https URL.

7. 【2511.16668】V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

链接：https://arxiv.org/abs/2511.16668

作者：Yang Luo,Xuanlei Zhao,Baijiong Lin,Lingting Zhu,Liyao Tang,Yuqi Liu,Ying-Cong Chen,Shengju Qian,Xin Wang,Yang You

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：shown surprising zero-shot, Recent progress, zero-shot reasoning abilities, surprising zero-shot reasoning, creating a growing

备注： Project Page: [this https URL](https://oahzxl.github.io/VReasonBench)

点击查看摘要

Abstract:Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

8. 【2511.16666】SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

链接：https://arxiv.org/abs/2511.16666

作者：Zhenyuan Qin,Xincheng Shuai,Henghui Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted increasing attention, manipulate visual content, Controllable image generation, enabling users, identity and style

备注： NeurIPS 2025 (Spotlight), Project Page: [this https URL](https://henghuiding.com/SceneDesigner/)

点击查看摘要

Abstract:Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at this https URL.

9. 【2511.16662】riDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

链接：https://arxiv.org/abs/2511.16662

作者：Eddie Pokming Sheung,Qihao Liu,Wufei Ma,Prakhar Kaushik,Jianwen Xie,Alan Yuille

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：textual descriptions remains, increasing demand, textual descriptions, descriptions remains, remains a significant

备注： 8 pages, 10 figures, Under review at a conference

点击查看摘要

Abstract:With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

10. 【2511.16659】PartUV: Part-Based UV Unwrapping of 3D Meshes

链接：https://arxiv.org/abs/2511.16659

作者：Zhaoning Wang,Xinyue Wei,Ruoxi Shi,Xiaoshuai Zhang,Hao Su,Minghua Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR)

关键词：complex surface, requiring the complex, decomposed into multiple, unwrapping flattens, unwrapping methods frequently

备注： project page: [this https URL](https://www.zhaoningwang.com/PartUV)

点击查看摘要

Abstract:UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at this https URL.

11. 【2511.16655】Solving Spatial Supersensing Without Spatial Supersensing

链接：https://arxiv.org/abs/2511.16655

作者：Vishaal Udandarao,Shyamgopal Karthik,Surabhi S. Nath,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：bespoke predictive sensing, predictive sensing inference, sensing inference strategies, bespoke predictive, steps towards improving

备注： Tech Report

点击查看摘要

Abstract:Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: this https URL

12. 【2511.16653】acher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

链接：https://arxiv.org/abs/2511.16653

作者：Md. Samiul Alim,Sharjil Khan,Amrijit Biswas,Fuad Rahman,Shafin Rahman,Nabeel Mohammed

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：deep neural networks, significant computational overhead, compressing deep neural, Unstructured pruning remains, neural networks

备注： Accepted at 2025 IEEE International Conference on Big Data (IEEE BigData 2025)

点击查看摘要

Abstract:Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

13. 【2511.16650】Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

链接：https://arxiv.org/abs/2511.16650

作者：Shuyu Cao,Chongshou Li,Jie Xu,Tianrui Li,Na Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：embodied intelligence applications, crucial for embodied, embodied intelligence, intelligence applications, applications that demand

备注：

点击查看摘要

Abstract:3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

14. 【2511.16642】RIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

链接：https://arxiv.org/abs/2511.16642

作者：Zeyuan Yin,Xiaoming Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian diffusion models, post-denoising processing due, Recent advances, diffusion models suffer, Gaussian diffusion

备注： NeurIPS 2025

点击查看摘要

Abstract:Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{this https URL}{link}$.

15. 【2511.16635】SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

链接：https://arxiv.org/abs/2511.16635

作者：Guolin Huang,Wenting Chen,Jiaqi Yang,Xinheng Lyu,Xiaoling Luo,Sen Yang,Xiaohan Xing,Linlin Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：existing methods lack, treatment planning, clinical adoption, critical for cancer, cancer prognosis

备注： 20 pages

点击查看摘要

16. 【2511.16624】SAM 3D: 3Dfy Anything in Images

链接：https://arxiv.org/abs/2511.16624

作者：SAM 3D Team,Xingyu Chen,Fu-Jen Chu,Pierre Gleize,Kevin J Liang,Alexander Sax,Hao Tang,Weiyao Wang,Michelle Guo,Thibaut Hardin,Xiang Li,Aohan Lin,Jiawei Liu,Ziqi Ma,Anushka Sagar,Bowen Song,Xiaodong Wang,Jianing Yang,Bowen Zhang,Piotr Dollár,Georgia Gkioxari,Matt Feiszli,Jitendra Malik

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：present SAM, predicting geometry, single image, visually grounded, providing visually grounded

备注： Website: [this https URL](https://ai.meta.com/sam3d/)

点击查看摘要

Abstract:We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

17. 【2511.16623】Adaptive Guided Upsampling for Low-light Image Enhancement

链接：https://arxiv.org/abs/2511.16623

作者：Angela Vivian Dcosta,Chunbo Song,Rafael Radkowski

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：Adaptive Guided Upsampling, introduce Adaptive Guided, introduce Adaptive, Guided Upsampling, Adaptive Guided

备注： 18 pages, 12 figures

点击查看摘要

Abstract:We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.

18. 【2511.16619】Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

链接：https://arxiv.org/abs/2511.16619

作者：Satyam Gaba

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：widely explored, explored for class-balanced, Object detection, Balanced Group Softmax, COCO

备注： 8 pages, 7 figures, International Conference on Semantic Computing

点击查看摘要

Abstract:Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.

Comments:
8 pages, 7 figures, International Conference on Semantic Computing

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2511.16619 [cs.CV]

(or
arXiv:2511.16619v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.16619

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Focus to learn more

            DOI(s) linking to related resources</p>

19. 【2511.16618】SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

链接：https://arxiv.org/abs/2511.16618

作者：Haofeng Liu,Ziyue Wang,Sudhanshu Mishra,Mingqi Gao,Guanyi Qin,Chang Han Low,Alex Y. W. Kong,Yueming Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)

关键词：Video Object Segmentation, Surgical video segmentation, enabling precise localization, Interactive Video Object, computer-assisted surgery

备注： 11 pages, 4 figures

点击查看摘要

Abstract:Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at this https URL.

20. 【2511.16617】Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap

链接：https://arxiv.org/abs/2511.16617

作者：Satyam Gaba

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：critical environmental challenge, mitigating large-scale damage, environmental challenge, large-scale damage, critical environmental

备注： 8 pages, 16 figures

点击查看摘要

Abstract:The early detection of wildfires is a critical environmental challenge, with timely identification of smoke plumes being key to mitigating large-scale damage. While deep neural networks have proven highly effective for localization tasks, the scarcity of large, annotated datasets for smoke detection limits their potential. In response, we leverage generative AI techniques to address this data limitation by synthesizing a comprehensive, annotated smoke dataset. We then explore unsupervised domain adaptation methods for smoke plume segmentation, analyzing their effectiveness in closing the gap between synthetic and real-world data. To further refine performance, we integrate advanced generative approaches such as style transfer, Generative Adversarial Networks (GANs), and image matting. These methods aim to enhance the realism of synthetic data and bridge the domain disparity, paving the way for more accurate and scalable wildfire detection models.

21. 【2511.16595】meViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

链接：https://arxiv.org/abs/2511.16595

作者：Boshen Xu,Zihan Xiao,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Qin Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：vision-language model designed, designed to tackle, tackle challenges, Processing long videos, long videos demands

备注： Project page: [this https URL](https://xuboshen.github.io/TimeViper)

点击查看摘要

22. 【2511.16593】Green Resilience of Cyber-Physical Systems: Doctoral Dissertation

链接：https://arxiv.org/abs/2511.16593

作者：Diaeddin Rimawi

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Cyber-physical systems, combine computational, physical components, computational and physical, Cyber-physical

备注：

点击查看摘要

Abstract:Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.

23. 【2511.16574】Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks

链接：https://arxiv.org/abs/2511.16574

作者：Nirjhor Datta,Md. Golam Rabiul Alam

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：selectively remove knowledge, continual dataset revision, ethical deployment, privacy compliance, ability to selectively

备注：

点击查看摘要

Abstract:The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2511.16574 [cs.CV]

(or
arXiv:2511.16574v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.16574

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

24. 【2511.16567】POMA-3D: The Point Map Way to 3D Scene Understanding

链接：https://arxiv.org/abs/2511.16567

作者：Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：representation model learned, point maps, point map, point, model learned

备注： 11 pages, 6 tables, 5 figures

点击查看摘要

Abstract:In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: this https URL

25. 【2511.16566】NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening

链接：https://arxiv.org/abs/2511.16566

作者：Misaal Khan,Mayank Vatsa,Kuldeep Singh,Richa Singh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Child malnutrition remains, existing screening methods, hindering early intervention, Child malnutrition, global crisis

备注： Accepted in AAAI 2026 Special Track on AI for Social Impact

点击查看摘要

Abstract:Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.

26. 【2511.16555】Lite Any Stereo: Efficient Zero-Shot Stereo Matching

链接：https://arxiv.org/abs/2511.16555

作者：Junpeng Jing,Weixun Luo,Ye Mao,Krystian Mikolajczyk

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, increased model size, significantly increased model, significantly increased, Recent

备注：

点击查看摘要

Abstract:Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

27. 【2511.16546】Progressive Supernet Training for Efficient Visual Autoregressive Modeling

链接：https://arxiv.org/abs/2511.16546

作者：Xiaoyue Chen,Yuling Shi,Kaiyuan Li,Huandong Wang,Yong Li,Xiaodong Gu,Xinlei Chen,Mingbao Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significantly reduce inference, reduce inference steps, Visual Auto-Regressive, models significantly reduce, prediction paradigm

备注： Submitted to CVPR 2025. 10 pages, 7 figures

点击查看摘要

Abstract:Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

Comments:
Submitted to CVPR 2025. 10 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2511.16546 [cs.CV]

(or
arXiv:2511.16546v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.16546

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Xiaoyue Chen [view email] [v1]
Thu, 20 Nov 2025 16:59:24 UTC (16,764 KB)

28. 【2511.16542】EOGS++: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering

链接：https://arxiv.org/abs/2511.16542

作者：Pierrick Bournez,Luca Savant Aira,Thibaud Ehret,Gabriele Facciolo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Earth Observation Gaussian, Observation Gaussian Splatting, Gaussian Splatting, Earth observation, reduced training times

备注： 8 pages, ISPRS

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models

29. 【2511.16541】Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

链接：https://arxiv.org/abs/2511.16541

作者：Jaime Álvarez Urueña,David Camacho,Javier Huertas Tato

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：digital media integrity, posing significant challenges, authentic content, media integrity, enabled the creation

备注： 17 pages, 6 figures, 6 tables

点击查看摘要

Abstract:The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

Comments:
17 pages, 6 figures, 6 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

ACMclasses:
I.2.10; I.4.10

Cite as:
arXiv:2511.16541 [cs.CV]

(or
arXiv:2511.16541v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.16541

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

30. 【2511.16535】Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation

链接：https://arxiv.org/abs/2511.16535

作者：Haytham Ziani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：optical flow computation, flow computation, paper presents, presents an applied, applied analysis

备注：

点击查看摘要

Abstract:This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.

31. 【2511.16532】Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration

链接：https://arxiv.org/abs/2511.16532

作者：Fan Yang,Shigeyuki Odashima,Shoichi Masui,Ikuo Kusajima,Sosuke Yamao,Shan Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：robust multi-camera gymnast, gymnastics, multi-camera gymnast, robust multi-camera, international gymnastics

备注：

点击查看摘要

Abstract:We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.

32. 【2511.16527】Contrastive vision-language learning with paraphrasing and negation

链接：https://arxiv.org/abs/2511.16527

作者：Kwun Ho Ngan,Saman Sadeghi Afgeh,Joe Townsend,Artur d'Avila Garcez

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Contrastive Language-Image Pre-training, CLIP, vision-language models continue, Contrastive, text retrieval

备注：

点击查看摘要

Abstract:Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.

33. 【2511.16524】BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization

链接：https://arxiv.org/abs/2511.16524

作者：Rahul Kumar,Vipul Baghel,Sudhanshu Singh,Bikash Kumar Badatya,Shivam Yadav,Babji Srinivasan,Ravi Hegde

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：major bottleneck due, robust datasets remains, recent years, unstructured nature, combat sports

备注：

点击查看摘要

Abstract:Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.

34. 【2511.16521】YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras

链接：https://arxiv.org/abs/2511.16521

作者：Fan Yang,Sosuke Yamao,Ikuo Kusajima,Atsunori Moteki,Shoichi Masui,Shan Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual capturing opens, scene layout, capturing opens, wide range, indoor visual capturing

备注：

点击查看摘要

Abstract:Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (this https URL). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.

35. 【2511.16518】MiMo-Embodied: X-Embodied Foundation Model Technical Report

链接：https://arxiv.org/abs/2511.16518

类目：Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving benchmarks, Autonomous Driving, cross-embodied foundation model, Driving Planning, integrate and achieve

备注： Code: [this https URL](https://github.com/XiaomiMiMo/MiMo-Embodied) Model: [this https URL](https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B)

点击查看摘要

36. 【2511.16498】Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI

链接：https://arxiv.org/abs/2511.16498

作者：Rui Wang,Yuexi Du,John Lewin,R. Todd Constable,Nicha C. Dvornek

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Dynamic contrast-enhanced magnetic, contrast-enhanced magnetic resonance, breast cancer screening, magnetic resonance imaging, plays an important

备注： 5 pages, 3 figures

点击查看摘要

Abstract:Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.

37. 【2511.16494】Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation

链接：https://arxiv.org/abs/2511.16494

作者：Zongcai Tan,Lan Wei,Dandan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：autonomous biological studies, high-precision object tracking, Precise pose estimation, enabling high-precision object, Precise pose

备注：

点击查看摘要

Abstract:Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent this http URL work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.

38. 【2511.16484】Flow and Depth Assisted Video Prediction with Latent Transformer

链接：https://arxiv.org/abs/2511.16484

作者：Eliyas Suleyman,Paul Henderson,Eksan Firkat,Nicolas Pugeault

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video prediction, video prediction models, downstream applications, including robotics, world modeling

备注：

点击查看摘要

Abstract:Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

39. 【2511.16471】FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry

链接：https://arxiv.org/abs/2511.16471

作者：Clemens Pollak,Kersten Diers,Santiago Estrada,David Kügler,Martin Reuter

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：largest commissural structure, corpus callosum, largest commissural, commissural structure, central focus

备注：

点击查看摘要

Abstract:The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.

40. 【2511.16470】Arctic-Extract Technical Report

链接：https://arxiv.org/abs/2511.16470

作者：Mateusz Chiliński,Julita Ołtusek,Wojciech Jaśkowski

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：extracting structural data, question answering, entities and tables, structural data, digital-born business documents

备注：

点击查看摘要

41. 【2511.16454】LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

链接：https://arxiv.org/abs/2511.16454

作者：Doriand Petit,Steve Bourgeois,Vincent Gay-Bellile,Florian Chabot,Loïc Barthe

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, Developing a multi-modal, scenes remains challenging, multi-modal language model, language model capable

备注： Accepted at AAAI'26

点击查看摘要

Abstract:Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

42. 【2511.16449】VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

链接：https://arxiv.org/abs/2511.16449

作者：Ziyan Liu,Yeqiu Chen,Hongyi Cai,Tao Lin,Shuo Yang,Zheng Liu,Bo Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：shown great promise, heavy computational cost, streams severely limits, processing continuous visual, continuous visual streams

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

43. 【2511.16440】StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

链接：https://arxiv.org/abs/2511.16440

作者：Diogo J. Paulo,João Martins,Hugo Proença,João C. Neves

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：smart cities, waste, remains a critical, critical challenge, development of smart

备注： Accepted at WACV 2026

点击查看摘要

Abstract:Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.

44. 【2511.16435】Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

链接：https://arxiv.org/abs/2511.16435

作者：Jin Wang,Bingfeng Zhang,Jian Pang,Mengyu Liu,Honglong Chen,Weifeng Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Few-shot segmentation, limited support samples, meta-learning paradigm, support images, support

备注：

点击查看摘要

Abstract:Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

45. 【2511.16430】Graph Neural Networks for Surgical Scene Segmentation

链接：https://arxiv.org/abs/2511.16430

作者：Yihan Li,Nikhil Churamani,Maria Robu,Imanol Luengo,Danail Stoyanov

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：preventing surgical complications, Accurate identification, Graph Neural Networks, hepatocystic anatomy, Graph Convolutional Network

备注： 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.

Comments:
12 pages, 4 figures, 3 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2511.16430 [cs.CV]

(or
arXiv:2511.16430v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.16430

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Nikhil Churamani [view email] [v1]
Thu, 20 Nov 2025 14:58:29 UTC (8,795 KB)

46. 【2511.16428】CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

链接：https://arxiv.org/abs/2511.16428

作者：Samer Abualhanud,Christian Grannemann,Max Mehltretter

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Self-supervised surround-view depth, Self-supervised surround-view, multiple minimally overlapping, estimation enables dense, surround-view depth estimation

备注：

点击查看摘要

Abstract:Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

47. 【2511.16418】End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss

链接：https://arxiv.org/abs/2511.16418

作者：Hai Lan,Zongyan Li,Jianmin Hu,Jialing Yang,Houde Dai

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Marker-based optical motion, marker identification ambiguity, dense marker configurations, faces practical challenges, optical motion capture

备注： The source code is available in : [this https URL](https://github.com/wer010/GLRBM-Mocap)

点击查看摘要

Abstract:Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.

48. 【2511.16378】CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

链接：https://arxiv.org/abs/2511.16378

作者：Pan Yang,Cheng Deng,Jing Yang,Han Zhao,Yun Liu,Yuling Chen,Xiaoli Ruan,Yanping Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Compositional zero-shot learning, Compositional zero-shot, Contrastive Language-Image Pre-training, zero-shot learning, based CZSL methods

备注：

点击查看摘要

Abstract:Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at this https URL.

49. 【2511.16364】DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

链接：https://arxiv.org/abs/2511.16364

作者：Meng-Cheng Shih,Tsai-Ling Huang,Yu-Heng Shih,Hong-Han Shuai,Hsuan-Tung Liu,Yi-Ren Yeh,Ching-Chun Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：frequently utilized technology, technology in forensics, frequently utilized, utilized technology, OSV

备注：

点击查看摘要

Abstract:Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.

50. 【2511.16361】Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

链接：https://arxiv.org/abs/2511.16361

作者：Zhengxue Wang,Zhiqiang Yan,Yuan Wu,Guangwei Gao,Xiang Li,Jian Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent guided depth, Recent guided, strictly spatial alignment, achieving high-quality depth, high-quality depth reconstruction

备注：

点击查看摘要

Abstract:Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

51. 【2511.16349】CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering

链接：https://arxiv.org/abs/2511.16349

作者：Joni Vanherck,Steven Moonen,Brent Zoomers,Kobe Werner,Jeroen Put,Lode Jorissen,Nick Michiels

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling reliable navigation, Extended Reality, robotics and Extended, Accurate camera localization, enabling reliable

备注：

点击查看摘要

Abstract:Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.

52. 【2511.16343】Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach

链接：https://arxiv.org/abs/2511.16343

作者：Chi-Han Chen,Chieh-Ming Chen,Wen-Huang Cheng,Ching-Chun Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：UAV remote sensing, remote sensing diverges, sensing diverges significantly, ground vehicle patrol, vehicle patrol tasks

备注：

点击查看摘要

Abstract:The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : this https URL

53. 【2511.16341】Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks

链接：https://arxiv.org/abs/2511.16341

作者：Yi Ting Tsai,Yu Wei Chen,Hong-Han Shuai,Ching-Chun Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enhancing low-resolution facial, Face super-resolution, face-related tasks, critical technique, technique for enhancing

备注：

点击查看摘要

Abstract:Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.

54. 【2511.16322】ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery

链接：https://arxiv.org/abs/2511.16322

作者：Ching-Heng Cheng,Chih-Chung Hsu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：co-registered bi-temporal images, Remote sensing change, Remote sensing, identify surface changes, sensing change detection

备注：

点击查看摘要

Abstract:Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at this https URL.

55. 【2511.16321】WWE-UIE: A Wavelet White Balance Efficient Network for Underwater Image Enhancement

链接：https://arxiv.org/abs/2511.16321

作者：Ching-Heng Cheng,Jen-Wei Lee,Chia-Ming Lee,Chih-Chung Hsu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Underwater Image Enhancement, correct color distortions, color distortions caused, aims to restore, Underwater Image

备注：

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at this https URL.

56. 【2511.16317】NaTex: Seamless Texture Generation as Latent Color Diffusion

链接：https://arxiv.org/abs/2511.16317

作者：Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Xin Yang,Xin Huang,Jingwei Huang,Xiangyu Yue,Chunchao Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：framework that predicts, color, texture, predicts texture color, texture color directly

备注： Technical Report

点击查看摘要

Abstract:We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.

57. 【2511.16315】BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

链接：https://arxiv.org/abs/2511.16315

作者：Samuel Stevens

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual representation quality, longer predicts performance, linear-probe transfer accuracy, transfer accuracy remains, linear-probe transfer

备注： Accepted at the 3rd Imageomics Workshop at NeurIPS 2025

点击查看摘要

Abstract:ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at this https URL and results at this https URL.

58. 【2511.16309】Sparse Autoencoders are Topic Models

链接：https://arxiv.org/abs/2511.16309

作者：Leander Girrbach,Zeynep Akata

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Sparse autoencoders, Latent Dirichlet Allocation, role and practical, Sparse, extend Latent Dirichlet

备注：

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

59. 【2511.16301】Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

链接：https://arxiv.org/abs/2511.16301

作者：Minseok Seo,Mark Hamilton,Changick Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, lightweight test-time optimization, restores low-resolution features, Foundation Models demonstrate, framework that restores

备注： 15 pages, 12 figures

点击查看摘要

Abstract:We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.

60. 【2511.16298】Optimizing 3D Gaussian Splattering for Mobile GPUs

链接：https://arxiv.org/abs/2511.16298

作者：Md Musfiqur Rahman Sanim,Zhihao Shu,Bahram Afsharmanesh,AmirAli Mirian,Jiexiong Guan,Wei Niu,Bin Ren,Gagan Agrawal

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：transforms multi-view images, surrounding environment, modern applications, transforms multi-view, multi-view images

备注：

点击查看摘要

Abstract:Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.

61. 【2511.16294】Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability

链接：https://arxiv.org/abs/2511.16294

作者：Abishek Karthik,Pandiyaraju V,Sreya Mynampati

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：agriculture crop management, accurate species identification, Convolutional Neural Networks, Graph Neural Networks, sustainable agriculture crop

备注：

点击查看摘要

Abstract:The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.

62. 【2511.16282】Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM

链接：https://arxiv.org/abs/2511.16282

作者：Gergely Dinya,Péter Halász,András Lőrincz,Kristóf Karacs,Anna Gelencsér-Horváth

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gated Generative Transformers, Vision Gated Generative, Generative Transformers, Vision Gated, Gated Generative

备注：

点击查看摘要

Abstract:We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.

63. 【2511.16273】raSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid

链接：https://arxiv.org/abs/2511.16273

作者：Seonghun Oh,Youngjung Uh,Jin-Hwa Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：signed distance functions, neural signed distance, Extracting meshes, distance functions, match the zero-level

备注：

点击查看摘要

Abstract:Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.

64. 【2511.16264】Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

链接：https://arxiv.org/abs/2511.16264

作者：Sinan Mutlu,Georgios F. Angelis,Savas Ozkan,Paul Wisbey,Anastasios Drosou,Mete Ozay

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Realistic and smooth, Head Mounted Devices, smooth full-body tracking, tracking is crucial, crucial for immersive

备注：

点击查看摘要

Abstract:Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

65. 【2511.16262】How Robot Dogs See the Unseeable

链接：https://arxiv.org/abs/2511.16262

作者：Oliver Bimber,Karl Dietrich von Ellenrieder,Michael Haller,Rakesh John Amala Arokia Nathan,Gianni Lunardi,Marco Camurri,Mohamed Youssef,Santos Miguel Orozco Soto,Jeremy E. Niven

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：powerful bio-inspired strategy, offers a powerful, estimate distance, powerful bio-inspired, bio-inspired strategy

备注：

点击查看摘要

Abstract:Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.

66. 【2511.16227】SwiTrack: Tri-State Switch for Cross-Modal Object Tracking

链接：https://arxiv.org/abs/2511.16227

作者：Boyue Xu,Ruichao Hou,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：RGB-Near Infrared, video stream switches, emerging task, task that maintains, focusing on RGB-Near

备注：

点击查看摘要

Abstract:Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at this https URL.

67. 【2511.16221】Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

链接：https://arxiv.org/abs/2511.16221

作者：Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Ruicong Liu,Yoichi Sato

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Large Language, Interactive Deception Assessment, Large Language Models, Multimodal Interactive Deception, complex social interactions

备注：

点击查看摘要

68. 【2511.16213】Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

链接：https://arxiv.org/abs/2511.16213

作者：Melih Baydar,Emre Akbas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantically meaningful categories, group unlabeled images, clustering, aims to group, meaningful categories

备注：

点击查看摘要

Abstract:Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

69. 【2511.16203】When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

链接：https://arxiv.org/abs/2511.16203

作者：Yuping Yan,Yuhan Xie,Yinxin Zhang,Lingjuan Lyu,Yaochu Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：recently demonstrated remarkable, demonstrated remarkable progress, unified multimodal understanding, enabling robots, robots to perceive

备注：

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

70. 【2511.16186】PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

链接：https://arxiv.org/abs/2511.16186

作者：Deniz Sayin Mercadier,Hieu Le,Yihong Chen,Jiancheng Yang,Udaranga Wickramasinghe,Pascal Fua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：spatial relationships constrain, Human organs, composed of interconnected, geometry and spatial, spatial relationships

备注： 12 pages, 9 figures

点击查看摘要

Abstract:Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.

71. 【2511.16184】Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

链接：https://arxiv.org/abs/2511.16184

作者：Nianchang Huang,Yi Xu,Ruida Xi,Ruida Xi,Qiang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable performance, Visible-Infrared person Re-Identification, achieved remarkable, remarkable performance, person Re-Identification

备注：

点击查看摘要

Abstract:Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.

72. 【2511.16183】FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos

链接：https://arxiv.org/abs/2511.16183

作者：Jeremie Ochin(CAOR),Raphael Chekroun,Bogdan Stanciulescu(CAOR),Sotiris Manitsaris(CAOR)

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：spatiotemporal action detection, temporal action localization, Soccer video understanding, video understanding, understanding has motivated

备注：

点击查看摘要

Abstract:Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.

73. 【2511.16175】Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

链接：https://arxiv.org/abs/2511.16175

作者：Yi Yang,Xueqi Li,Yiyang Chen,Jin Song,Yihan Wang,Zipeng Xiao,Jiadi Su,You Qiaoben,Pengfei Liu,Zhijie Deng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：effectively complement sparse, Recent advances, complement sparse action, sparse action supervisions, effectively complement

备注：

点击查看摘要

Abstract:Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $\pi_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

74. 【2511.16170】arget Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

链接：https://arxiv.org/abs/2511.16170

作者：Jiahao Li,Yang Lu,Yachao Zhang,Yong Xie,Fangyong Wang,Yuan Xie,Yanyun Qu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Open-vocabulary semantic segmentation, associate category-related prompts, Open-vocabulary semantic, employs pixel-level vision-language, semantic segmentation

备注： Accepted by AAAI 2026

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

75. 【2511.16166】EvoVLA: Self-Evolving Vision-Language-Action Model

链接：https://arxiv.org/abs/2511.16166

作者：Zeting Liu,Zida Yang,Zeyu Zhang,Hao Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：robotic manipulation remains, manipulation remains challenging, Current VLA models, remains challenging, VLA models suffer

备注：

点击查看摘要

Abstract:Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: this https URL. Website: this https URL.

76. 【2511.16163】An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

链接：https://arxiv.org/abs/2511.16163

作者：Zhi Luo,Zenghui Yuan,Wenqi Wei,Daizong Liu,Pan Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-Language Models, http URL, http URL studies, http URL address, multimodal tasks

备注：

点击查看摘要

Abstract:With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation this http URL studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and this http URL address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed this http URL, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

77. 【2511.16162】Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

链接：https://arxiv.org/abs/2511.16162

作者：Yuting Lu,Ziliang Wang,Weixin Xu,Wei Zhang,Yongqiang Zhao,Yang Yu,Xiaohong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Clinical deployment requires, Clinical deployment, deployment requires segmentation, requires segmentation models, Selective Wavelet Reconstruction

备注：

点击查看摘要

Abstract:Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.

78. 【2511.16161】Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion

链接：https://arxiv.org/abs/2511.16161

作者：Lirui Zhang,Zhengkai Zhao,Zhi Zuo,Pan Gao,Jie Qin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Point cloud completion, Point cloud, cloud completion, fundamental task, Point

备注： Accepted for publication at the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.

79. 【2511.16160】Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

链接：https://arxiv.org/abs/2511.16160

作者：Yibin Huang,Wang Xu,Wanyue Zhang,Helu Zhi,Jingjing Huang,Yangbin Xu,Yangang Sun,Conghui Zhu,Tiejun Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Multimodal Large, Large Language Models, frontier for Multimodal, Large Language

备注：

点击查看摘要

Abstract:Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at this https URL.

80. 【2511.16156】Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

链接：https://arxiv.org/abs/2511.16156

作者：Jian Ma,Qirong Peng,Xujie Zhu,Peixing Xie,Chen Chen,Haonan Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high computational costs, shown exceptional performance, incur high computational, counts incur high, computational costs

备注： [this https URL](https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning)

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: this https URL.

81. 【2511.16150】Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

链接：https://arxiv.org/abs/2511.16150

作者：Chunxu Liu,Jiyuan Yang,Ruopeng Gao,Yuhan Zhu,Feng Zhu,Rui Zhao,Limin Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：shared representation space, Multimodal Large Language, Large Language Models, enabling alignment, downstream tasks

备注：

点击查看摘要

Abstract:Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.

82. 【2511.16144】LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

链接：https://arxiv.org/abs/2511.16144

作者：Sibaek Lee,Seongbo Ha,Kyeongsu Kang,Joonyeol Choi,Seungjun Tak,Hyeonwoo Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：enabled Simultaneous Localization, Simultaneous Localization, Recent advances, enabled Simultaneous, build photorealistic maps

备注： 18 pages

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.

83. 【2511.16143】A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

链接：https://arxiv.org/abs/2511.16143

作者：Quanqing Ma,Jiaen Chen,Peng Wang,Yao Zheng,Qingzhan Zhao,Yuchen Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote sensing Water, Water Body, Remote sensing, body surface changes, Body Change Detection

备注：

点击查看摘要

Abstract:Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at this https URL.

84. 【2511.16140】Real-Time 3D Object Detection with Inference-Aligned Learning

链接：https://arxiv.org/abs/2511.16140

作者：Chenyu Zhao,Xianwei Zheng,Zimin Xia,Linwei Yue,Nan Xue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic scene understanding, object detection, robotics and navigation, point clouds, augmented reality

备注： Accepted by AAAI 2026

点击查看摘要

Abstract:Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.

85. 【2511.16137】Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video

链接：https://arxiv.org/abs/2511.16137

作者：Li Yu,Yingbo Zhao,Shiyu Wu,Siyue Yu,Moncef Gabbouj,Qingshan Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：employing distinct enhancement, Quantization Parameters, Quality Enhancement, distinct enhancement models, Existing studies

备注：

点击查看摘要

Abstract:Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.

86. 【2511.16136】How Noise Benefits AI-generated Image Detection

链接：https://arxiv.org/abs/2511.16136

作者：Jiazhen Yan,Ziqiang Li,Fan Wang,Kai Zeng,Zhangjie Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：images increasingly indistinguishable, increasingly indistinguishable, synthetic images increasingly, rapid advancement, made real

备注：

点击查看摘要

Abstract:The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

87. 【2511.16124】VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

链接：https://arxiv.org/abs/2511.16124

作者：Chenyang Wu,Jiayi Fu,Chun-Le Guo,Shuhao Han,Chongyi Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Due to large, high computational cost, Video Frame Interpolation, large pixel movement, computational cost

备注： Accepted by AAAI 2026

点击查看摘要

Abstract:Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: this https URL.

88. 【2511.16117】Decoupling Complexity from Scale in Latent Diffusion Model

链接：https://arxiv.org/abs/2511.16117

作者：Tianxiong Zhong,Xingye Tian,Xuebo Wang,Boyuan Jiang,Xin Tao,Pengfei Wan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing latent diffusion, represent higher-resolution images, higher-frame rate videos, typically couple scale, diffusion models typically

备注： 15 pages, 16 figures

点击查看摘要

Abstract:Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.

89. 【2511.16112】Clustered Error Correction with Grouped 4D Gaussian Splatting

链接：https://arxiv.org/abs/2511.16112

作者：Taeho Kang,Jaeyeon Park,Kyungjin Lee,Youngki Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：resolve ambiguous pixel, ambiguous pixel correspondences, Gaussian Splatting, reconstruct dynamic scenes, accurately reconstruct dynamic

备注： 16 pages, 8 figures, SIGGRAPH Asia Conference Papers 2025

点击查看摘要

Abstract:Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at this https URL.

90. 【2511.16107】2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

链接：https://arxiv.org/abs/2511.16107

作者：Shao-Jun Xia,Huixin Zhang,Zhengzhong Tu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：small demonstrations provided, large language models, in-context learning, visual in-context learning, cross-task VICL

备注：

点击查看摘要

Abstract:In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

91. 【2511.16091】Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

链接：https://arxiv.org/abs/2511.16091

作者：Renxiang Xiao,Wei Liu,Yuanfan Zhang,Yushuai Chen,Jinming Chen,Zilu Wang,Liang Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：radar-camera SLAM system, SLAM system designed, radar-camera SLAM, differentiable spatial representation, SLAM system

备注：

点击查看摘要

Abstract:We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.

92. 【2511.16084】SpectralTrain: A Universal Framework for Hyperspectral Image Classification

链接：https://arxiv.org/abs/2511.16084

作者：Meihua Zhou,Liping Yu,Jiawei Cai,Wai Kin Fung,Ruiguo Hu,Jiarui Zhao,Wenzhuo Liu,Nan Wan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：typically involves large-scale, involves large-scale data, Hyperspectral image, classification typically involves, computationally intensive training

备注：

点击查看摘要

Abstract:Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at this https URL.

93. 【2511.16077】VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

链接：https://arxiv.org/abs/2511.16077

作者：Zishan Xu,Yifu Guo,Yuquan Lu,Fengyu Yang,Junxin Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Traditional video reasoning, segmentation methods rely, Traditional video, supervised fine-tuning, scenarios and lacks

备注：

点击查看摘要

Abstract:Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at this https URL.

94. 【2511.16049】LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

链接：https://arxiv.org/abs/2511.16049

作者：Pei Liu,Songtao Wang,Lang Zhang,Xingyue Peng,Yuandong Lyu,Jiaxin Deng,Songxin Lu,Weiliang Ma,Xueyang Zhang,Yifei Zhan,XianPeng Lang,Jun Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthesizing high-fidelity, scalable simulation environments, Masked Generative START, Synthesizing, creating scalable simulation

备注：

点击查看摘要

Abstract:Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: this https URL.

95. 【2511.16047】AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

链接：https://arxiv.org/abs/2511.16047

作者：Boxun Xu,Yu Wang,Zihu Wang,Peng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual autoregressive modeling, Visual autoregressive, autoregressive modeling, image generation paradigm, next-scale prediction

备注：

点击查看摘要

Abstract:Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

96. 【2511.16037】LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

链接：https://arxiv.org/abs/2511.16037

作者：Qing Wang,Chong-Wah Ngo,Ee-Peng Lim,Qianru Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：training samples, free-living environment, typically crawled, pictures captured, captured by users

备注：

点击查看摘要

Abstract:Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.

97. 【2511.16031】Crossmodal learning for Crop Canopy Trait Estimation

链接：https://arxiv.org/abs/2511.16031

作者：Timilehin T. Ayanlade,Anirudha Powadi,Talukder Z. Jubery,Baskar Ganapathysubramanian,Soumik Sarkar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：driven widespread adoption, Unmanned Aerial Vehicles, Recent advances, multi sensor platforms, canopy reflectance data

备注： 18 pages, 7 figures

点击查看摘要

Abstract:Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.

98. 【2511.16030】CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

链接：https://arxiv.org/abs/2511.16030

作者：Zijian Wu,Mingfeng Jiang,Zidian Lin,Ying Song,Hanjie Ma,Qun Wu,Dongping Zhang,Guiyang Pu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, high-fidelity representation, recently emerged, representation for real-time, real-time scene reconstruction

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: this https URL

99. 【2511.16026】owards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

链接：https://arxiv.org/abs/2511.16026

作者：Mohamed Abdallah Salem,Hamdy Ahmed Ashur,Ahmed Elshinnawy

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：widely adopted technology, Laser cutting, amount of dust, aerosols during operation, posing a risk

备注：

点击查看摘要

Abstract:Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.

100. 【2511.16024】Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

链接：https://arxiv.org/abs/2511.16024

作者：Xiao He,Zhijun Tu,Kun Cheng,Mingrui Zhu,Jie Hu,Nannan Wang,Xinbo Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：DeepSeek and Grok, success of sparsely-gated, diverse domains, demonstrated success, motivated researchers

备注： 16 pages, Accepted by AAAI 2026

点击查看摘要

Abstract:The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.

101. 【2511.16020】Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

链接：https://arxiv.org/abs/2511.16020

作者：Dingkun Zhou,Patrick P. K. Chan,Hengxu Wu,Shikang Zheng,Ruiqi Huang,Yuanjie Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Deep neural networks, real surveillance environments, Deep neural, creating safety, surveillance environments

备注：

点击查看摘要

Abstract:Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.

102. 【2511.16015】Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

链接：https://arxiv.org/abs/2511.16015

作者：Nimeshika Udayangani,Hadi M. Dolatabadi,Sarah Erfani,Christopher Leckie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep neural networks, essential for safe, safe deployment, deployment of deep, deep neural

备注：

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.

103. 【2511.15986】Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

链接：https://arxiv.org/abs/2511.15986

作者：Dawei Li,Zijian Gu,Peng Wang,Chuhan Song,Zhen Tan,Mohan Zhang,Tianlong Chen,Yu Tian,Song Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词：Multimodal large language, Multimodal large, large language models, demographic groups remains, major concern

备注： 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

104. 【2511.15984】UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

链接：https://arxiv.org/abs/2511.15984

作者：Xinyu Nan,Lingtao Mao,Huangyu Dai,Zexin Zheng,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：handles object detection, visual semantic understanding, semantic understanding requires, simultaneously handles object, Achieving visual semantic

备注：

点击查看摘要

Abstract:Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.

105. 【2511.15968】Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

链接：https://arxiv.org/abs/2511.15968

作者：Jingru Zhang,Saed Moradi,Ashirbani Saha

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：jointly trained models, trained models underperform, models underperform single-task, underperform single-task baselines, Multi-task learning

备注：

点击查看摘要

Abstract:Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.

106. 【2511.15967】InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

链接：https://arxiv.org/abs/2511.15967

作者：Muyao Yuan,Yuanhong Zhang,Weizhan Zhang,Lan Ma,Yuan Gao,Jiangyong Ying,Yudeng Xin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strong generalization ability, pretrained CLIP, arbitrary text, CLIP, strong generalization

备注： Accepted by AAAI 2026

点击查看摘要

Abstract:Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

107. 【2511.15948】Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

链接：https://arxiv.org/abs/2511.15948

作者：Raphael Ruschel,Hardikkumar Prajapati,Awsafur Rahman,B.S. Manjunath

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scene Graph Generation, systems provide structured, Video Scene Graph, Graph Generation, provide structured visual

备注：

点击查看摘要

Abstract:State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts subject, object, predicate triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

108. 【2511.15946】Automated Interpretable 2D Video Extraction from 3D Echocardiography

链接：https://arxiv.org/abs/2511.15946

作者：Milos Vukadinovic,Hirotaka Ieki,Yuki Sahasi,David Ouyang,Bryan He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：conventional medical imaging, individual cardiac structures, showing individual cardiac, videos showing individual, complex three-dimensional

备注： 12 pages, 5 figures

点击查看摘要

Abstract:Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96\% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos this https URL .

109. 【2511.15943】Boosting Medical Visual Understanding From Multi-Granular Language Learning

链接：https://arxiv.org/abs/2511.15943

作者：Zihan Li,Yiqing Wang,Sina Farsiu,Paul Kinahan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significantly enhanced visual, enhanced visual understanding, Recent advances, enhanced visual, visual understanding

备注： Preprint. 40 pages

点击查看摘要

Abstract:Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{this https URL}{this https URL}.

110. 【2511.15923】RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

链接：https://arxiv.org/abs/2511.15923

作者：Meilong Xu,Di Fu,Jiaxing Zhang,Gong Yu,Jiayu Zheng,Xiaoling Hu,Dongdi Zhao,Feiyang Li,Chao Chen,Yong Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Language Models, Vision Language, Language Models, multimedia understanding, increasingly integral

备注： 11 pages, 2 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

111. 【2511.15884】Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

链接：https://arxiv.org/abs/2511.15884

作者：Yintao Ma,Sajjad Pakdamansavoji,Amir Rasouli,Tongtong Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Accurate and efficient, bin picking, e-commerce fulfillment, clutter and occlusion, occlusion is critical

备注：

点击查看摘要

Abstract:Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2511.15884 [cs.CV]

(or
arXiv:2511.15884v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.15884

Focus to learn more

              arXiv-issued DOI via DataCite</p>

112. 【2511.15875】Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

链接：https://arxiv.org/abs/2511.15875

作者：Lukas Arzoumanidis,Julius Knechtel,Jan-Henrik Haunert,Youness Dehbi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision applications, vision applications, automated analysis, drastically benefited, benefited from advances

备注：

点击查看摘要

Abstract:The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

113. 【2511.15874】WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

链接：https://arxiv.org/abs/2511.15874

作者：Sajjad Pakdamansavoji,Yintao Ma,Amir Rasouli,Tongtong Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：augmented reality, vital for robotics, scene understanding, Accurate, object pose estimation

备注：

点击查看摘要

Abstract:Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.

114. 【2511.15833】EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

链接：https://arxiv.org/abs/2511.15833

作者：Chengxi Zeng,Yuxuan Jiang,Aaron Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：advances visual understanding, Promptable Concept Segmentation, shared vision backbone, Progressive Hierarchical Distillation, Temporal Memory Distillation

备注： Github: [this https URL](https://github.com/SimonZeng7108/efficientsam3)

点击查看摘要

Abstract:The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.

115. 【2511.15831】UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

链接：https://arxiv.org/abs/2511.15831

作者：Wei Zhang,Yeying Jin,Xin Li,Yan Zhang,Xiaofeng Cong,Cong Wang,Fengcai Qiao,zhichao Lian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image-based virtual try-on, Image-based virtual, universal VTON framework, synthesize photorealistic images, aims to synthesize

备注： accepted to AAAI-2026

点击查看摘要

Abstract:Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at this https URL.

116. 【2511.15717】How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI

链接：https://arxiv.org/abs/2511.15717

作者：Bo Wen,Chen Wang,Erhan Bilal

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：prize competitions make, competitions make progress, harder held-out tasks, small color-quantized grids, systematic generalization

备注：

点击查看摘要

Abstract:ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks -- text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing -- thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.

117. 【2406.10219】PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting

链接：https://arxiv.org/abs/2406.10219

作者：Alex Hanson,Allen Tu,Vasu Singla,Mayuka Jayawardhana,Matthias Zwicker,Tom Goldstein

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：Recent advances, enabled real-time rendering, synthesis have enabled, enabled real-time, Gaussians

备注： CVPR 2025, Project Page: [this https URL](https://pup3dgs.github.io/)

点击查看摘要

Abstract:Recent advances in novel view synthesis have enabled real-time rendering speeds with high reconstruction accuracy. 3D Gaussian Splatting (3D-GS), a foundational point-based parametric 3D scene representation, models scenes as large sets of 3D Gaussians. However, complex scenes can consist of millions of Gaussians, resulting in high storage and memory requirements that limit the viability of 3D-GS on devices with limited resources. Current techniques for compressing these pretrained models by pruning Gaussians rely on combining heuristics to determine which Gaussians to remove. At high compression ratios, these pruned scenes suffer from heavy degradation of visual fidelity and loss of foreground details. In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. It is computed as a second-order approximation of the reconstruction error on the training views with respect to the spatial parameters of each Gaussian. Additionally, we propose a multi-round prune-refine pipeline that can be applied to any pretrained 3D-GS model without changing its training pipeline. After pruning 90% of Gaussians, a substantially higher percentage than previous methods, our PUP 3D-GS pipeline increases average rendering speed by 3.56$\times$ while retaining more salient foreground information and achieving higher image quality metrics than existing techniques on scenes from the Mip-NeRF 360, Tanks Temples, and Deep Blending datasets.

118. 【2511.16268】Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images

链接：https://arxiv.org/abs/2511.16268

作者：Erwan Dereure,Robin Louiset,Laura Parkkinen,David A Menassa,David Holcman

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词：forming Lewy bodies, Parkinson disease, Lewy Body Disease, pathology diagnostics, misfolded alpha-synuclein aggregates

备注：

点击查看摘要

Abstract:Parkinson's disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80\%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.

119. 【2511.15771】UniUltra: Interactive Parameter-Efficient SAM2 for Universal Ultrasound Segmentation

链接：https://arxiv.org/abs/2511.15771

作者：Yue Li,Qing Xu,Yixuan Zhang,Xiangjian He,Qian Zhang,Yuan Yao,Fiseha B. Tesem,Xin Chen,Ruili Wang,Zhen Chen,Chang Wen Chen

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：remarkable universal segmentation, demonstrates remarkable universal, natural images, Segment, Segment Anything Model

备注：

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2) demonstrates remarkable universal segmentation capabilities on natural images. However, its performance on ultrasound images is significantly degraded due to domain disparities. This limitation raises two critical challenges: how to efficiently adapt SAM2 to ultrasound imaging while maintaining parameter efficiency, and how to deploy the adapted model effectively in resource-constrained clinical environments. To address these issues, we propose UniUltra for universal ultrasound segmentation. Specifically, we first introduce a novel context-edge hybrid adapter (CH-Adapter) that enhances fine-grained perception across diverse ultrasound imaging modalities while achieving parameter-efficient fine-tuning. To further improve clinical applicability, we develop a deep-supervised knowledge distillation (DSKD) technique that transfers knowledge from the large image encoder of the fine-tuned SAM2 to a super lightweight encoder, substantially reducing computational requirements without compromising performance. Extensive experiments demonstrate that UniUltra outperforms state-of-the-arts with superior generalization capabilities. Notably, our framework achieves competitive performance using only 8.91% of SAM2's parameters during fine-tuning, and the final compressed model reduces the parameter count by 94.08% compared to the original SAM2, making it highly suitable for practical clinical deployment. The source code is available at this https URL.

120. 【2506.22568】Maximum Dispersion, Maximum Concentration: Enhancing the Quality of MOP Solutions

链接：https://arxiv.org/abs/2506.22568

作者：Gladston Moreira,Ivan Meneghini,Elizabeth Wanner

类目：Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词：decision space, objective space, space, Multi-objective optimization problems, decision

备注： 11 pages

点击查看摘要

Abstract:Multi-objective optimization problems (MOPs) often require a trade-off between conflicting objectives, maximizing diversity and convergence in the objective space. This study presents an approach to improve the quality of MOP solutions by optimizing the dispersion in the decision space and the convergence in a specific region of the objective space. Our approach defines a Region of Interest (ROI) based on a cone representing the decision maker's preferences in the objective space, while enhancing the dispersion of solutions in the decision space using a uniformity measure. Combining solution concentration in the objective space with dispersion in the decision space intensifies the search for Pareto-optimal solutions while increasing solution diversity. When combined, these characteristics improve the quality of solutions and avoid the bias caused by clustering solutions in a specific region of the decision space. Preliminary experiments suggest that this method enhances multi-objective optimization by generating solutions that effectively balance dispersion and concentration, thereby mitigating bias in the decision space.