本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新517篇论文，其中：

自然语言处理59篇
信息检索15篇
计算机视觉125篇

自然语言处理

1. 【2602.21202】Multi-Vector Index Compression in Any Modality

作者：Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：study efficient multi-vector, efficient multi-vector retrieval, late interaction, study efficient, efficient multi-vector

备注： 12 pages, 4 figures

点击查看摘要

Abstract:We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: this http URL.

2. 【2602.21201】Aletheia tackles FirstProof autonomously

链接：https://arxiv.org/abs/2602.21201

作者：Tony Feng,Junehyuk Jung,Sang-hyun Kim,Carlo Pagano,Sergei Gukov,Chiang-Chiang Tsai,David Woodruff,Adel Javanmard,Aryan Mokhtari,Dawsen Hwang,Yuri Chervonyi,Jonathan N. Lee,Garrett Bingham,Trieu H. Trinh,Vahab Mirrokni,Quoc V. Le,Thang Luong

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：mathematics research agent, research agent powered, powered by Gemini, inaugural FirstProof challenge, report the performance

备注： 34 pages. Project page: [this https URL](https://github.com/google-deepmind/superhuman/tree/main/aletheia)

点击查看摘要

Abstract:We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at this https URL.

3. 【2602.21198】Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

链接：https://arxiv.org/abs/2602.21198

作者：Yining Hong,Huang Huang,Manling Li,Li Fei-Fei,Jiajun Wu,Yejin Choi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Embodied LLMs endow, high-level task reasoning, LLMs endow robots, Embodied LLMs, Reflective Test-Time Planning

备注：

点击查看摘要

Abstract:Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

4. 【2602.21193】On Data Engineering for Scaling LLM Terminal Capabilities

链接：https://arxiv.org/abs/2602.21193

作者：Renjie Pi,Grace Lam,Mohammad Shoeybi,Pooya Jannaty,Bryan Catanzaro,Wei Ping

类目：Computation and Language (cs.CL)

关键词：remain largely undisclosed, rapid recent progress, agents remain largely, terminal agents remain, large language models

备注：

点击查看摘要

Abstract:Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at this https URL.

5. 【2602.21165】PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

链接：https://arxiv.org/abs/2602.21165

作者：Samah Fodeh,Linhai Ma,Yan Wang,Srivani Talakokkul,Ganesh Puthiaraju,Afshan Khan,Ashley Hagaman,Sarah Lowe,Aimee Roundtree

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：reflecting communicative behaviors, Patient-generated text, reflecting communicative, interviews contains rich, rich expressions

备注：

点击查看摘要

Abstract:Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

6. 【2602.21158】SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

链接：https://arxiv.org/abs/2602.21158

作者：Dengjia Zhang,Xiaoou Liu,Lu Cheng,Yaqing Wang,Kenton Murray,Hua Wei

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large language models, Large language, multi-step decision-making agents, Evolving LLM Agent, effective reward design

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

7. 【2602.21143】A Benchmark for Deep Information Synthesis

链接：https://arxiv.org/abs/2602.21143

作者：Debjit Paul,Daniel Murphy,Milan Gritta,Ronald Cardenas,Victor Prokhorov,Lena Sophia Bolliger,Aysim Toker,Roy Miles,Andreea-Maria Oncescu,Jasivan Alex Sivakumar,Philipp Borchert,Ismail Elezi,Meiru Zhang,Ka Yiu Lee,Guchun Zhang,Jun Wang,Gerasimos Lampouras

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Large language model, complex tasks involving, tasks involving tool, code execution, solve complex tasks

备注： Accepted at ICLR 2026

点击查看摘要

Abstract:Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.

8. 【2602.21103】Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

链接：https://arxiv.org/abs/2602.21103

作者：Sanket Badhe,Deep Shah

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：test-time inference costs, Advanced reasoning typically, substantial test-time inference, reasoning typically requires, incurs prohibitive latency

备注：

点击查看摘要

Abstract:Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

9. 【2602.21082】Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

链接：https://arxiv.org/abs/2602.21082

作者：Vishal Patil,Shree Vaishnavi Bacha,Revanth Yamani,Yidan Sun,Mayank Kejriwal

类目：Computation and Language (cs.CL)

关键词：Customer-provided reviews, important source, source of information, information for business, business owners

备注：

点击查看摘要

Abstract:Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.

10. 【2602.21059】An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

链接：https://arxiv.org/abs/2602.21059

作者：Anna Martin-Boyle,William Humphreys,Martha Brown,Cara Leckey,Harmanpreet Kaur

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, reliability remains uncertain, transforming scholarly tasks

备注： 24 pages, 2 figures. Accepted at ACM CHI conference on Human Factors in Computing Systems, 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

11. 【2602.21054】VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

链接：https://arxiv.org/abs/2602.21054

作者：Seongheon Park,Changdae Oh,Hyeong Kyu Choi,Xuefeng Du,Sharon Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Vision-Language Models, Large Vision-Language, frequently hallucinate, limiting their safe, real-world applications

备注：

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.

12. 【2602.21045】PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly QA

链接：https://arxiv.org/abs/2602.21045

作者：Anna Martin-Boyle,Cara A.C. Leckey,Martha C. Brown,Harmanpreet Kaur

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词：Large language models, synthesize vast amounts, Large language, researchers synthesize vast, language models

备注： 25 pages, 3 figures. Accepted at the ACM CHI conference on Human Factors in Computing Systems 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.

13. 【2602.21009】HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders

链接：https://arxiv.org/abs/2602.21009

作者：Kun Yuan,Junyu Bi,Daixuan Cheng,Changfa Wu,Shuwen Xiao,Binbin Cao,Jian Wu,Yuning Jiang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Modern recommender systems, recommender systems leverage, systems leverage ultra-long, leverage ultra-long user, Modern recommender

备注：

点击查看摘要

Abstract:Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints. While summarizing history via interest centers offers a practical alternative, existing methods struggle to (1) identify user-specific centers at appropriate granularity and (2) accurately assign behaviors, leading to quantization errors and loss of long-tail preferences. To alleviate these issues, we propose Hierarchical Sparse Activation Compression (HiSAC), an efficient framework for personalized sequence modeling. HiSAC encodes interactions into multi-level semantic IDs and constructs a global hierarchical codebook. A hierarchical voting mechanism sparsely activates personalized interest-agents as fine-grained preference centers. Guided by these agents, Soft-Routing Attention aggregates historical signals in semantic space, weighting by similarity to minimize quantization error and retain long-tail behaviors. Deployed on Taobao's "Guess What You Like" homepage, HiSAC achieves significant compression and cost reduction, with online A/B tests showing a consistent 1.65% CTR uplift -- demonstrating its scalability and real-world effectiveness.

14. 【2602.20995】Generative Pseudo-Labeling for Pre-Ranking with LLMs

链接：https://arxiv.org/abs/2602.20995

作者：Junyu Bi,Xinting Niu,Daixuan Cheng,Kun Yuan,Tao Wang,Binbin Cao,Jian Wu,Yuning Jiang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：efficiently scoring thousands, tasked with efficiently, downstream ranking, critical stage, stage in industrial

备注：

点击查看摘要

Abstract:Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discrepancy: pre-ranking models are trained only on exposed interactions, yet must score all recalled candidates -- including unexposed items -- during online serving. This mismatch not only induces severe sample selection bias but also degrades generalization, especially for long-tail content. Existing debiasing approaches typically rely on heuristics (e.g., negative sampling) or distillation from biased rankers, which either mislabel plausible unexposed items as negatives or propagate exposure bias into pseudo-labels. In this work, we propose Generative Pseudo-Labeling (GPL), a framework that leverages large language models (LLMs) to generate unbiased, content-aware pseudo-labels for unexposed items, explicitly aligning the training distribution with the online serving space. By offline generating user-specific interest anchors and matching them with candidates in a frozen semantic space, GPL provides high-quality supervision without adding online latency. Deployed in a large-scale production system, GPL improves click-through rate by 3.07%, while significantly enhancing recommendation diversity and long-tail item discovery.

15. 【2602.20976】Evaluating Proactive Risk Awareness of Large Language Models

链接：https://arxiv.org/abs/2602.20976

作者：Xuan Luo,Yubin Chen,Zhiyu Hou,Linpu Yu,Geng Tu,Jing Li,Ruifeng Xu

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：explicit harmful intent, large language models, safety responsibilities extend, everyday decision-making, increasingly embedded

备注：

点击查看摘要

Abstract:As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.

16. 【2602.20973】Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving

链接：https://arxiv.org/abs/2602.20973

作者：Yuliang Ji,Fuchen Shen,Jian Wu,Qiujie Xie,Yue Zhang

类目：Computation and Language (cs.CL)

关键词：Large Language Models, capabilities of Large, introduced abundant mathematical, Large Language, researchers have introduced

备注：

点击查看摘要

Abstract:To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs' reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.

17. 【2602.20966】Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

链接：https://arxiv.org/abs/2602.20966

作者：Paola Merlo,Chunyang Jiang,Giuseppe Samo,Vivi Nastase

类目：Computation and Language (cs.CL)

关键词：Blackbird Language Matrices, Language Matrices, Blackbird Language, inspired by intelligence, intelligence tests

备注： Under review, 46 pages, 5 tables, 28 figures

点击查看摘要

Abstract:This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

Comments:
Under review, 46 pages, 5 tables, 28 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.20966 [cs.CL]

(or
arXiv:2602.20966v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.20966

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

18. 【2602.20945】he Art of Efficient Reasoning: Data, Reward, and Optimization

链接：https://arxiv.org/abs/2602.20945

作者：Taiqiang Wu,Zenan Zu,Bo Zhou,Ngai Wong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, heavy computational overhead, consistently benefit

备注： Tech Report, Insights on Efficient Reasoning via Reward Shaping

点击查看摘要

Abstract:Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

19. 【2602.20918】Predicting Sentence Acceptability Judgments in Multimodal Contexts

链接：https://arxiv.org/abs/2602.20918

作者：Hyewon Jang,Nikolai Ilinykh,Sharid Loáiciga,Jey Han Lau,Shalom Lappin

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：deep neural networks, neural networks, visual contexts, examined the capacity, capacity of deep

备注：

点击查看摘要

Abstract:Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.

20. 【2602.20892】Exa-PSD: a new Persian sentiment analysis dataset on Twitter

链接：https://arxiv.org/abs/2602.20892

作者：Seyed Himan Ghaderi,Saeed Sarbazi Azad,Mohammad Mehdi Jaziriyan,Ahmad Akbari

类目：Computation and Language (cs.CL)

关键词：widely used platforms, platforms for communication, Persian, Sentiment analysis, communication of people

备注： This is the original submitted (preprint) version of a paper published in Language Resources and Evaluation. The final published version is available at Springer via DOI: [this https URL](https://doi.org/10.1007/s10579-025-09886-5)

点击查看摘要

Abstract:Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained Pars Bert and Roberta as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.

21. 【2602.20859】FinAnchor: Aligned Multi-Model Representations for Financial Prediction

链接：https://arxiv.org/abs/2602.20859

作者：Zirui He,Huopu Zhang,Yanguang Liu,Sirui Wu,Mengnan Du

类目：Computation and Language (cs.CL)

关键词：involves significant challenges, long documents involves, documents involves significant, generating embeddings varies, significant challenges

备注： 11 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Financial prediction from long documents involves significant challenges, as actionable signals are often sparse and obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods. In this paper, we propose FinAnchor(Financial Anchored Representations), a lightweight framework that integrates embeddings from multiple LLMs without fine-tuning the underlying models. FinAnchor addresses the incompatibility of feature spaces by selecting an anchor embedding space and learning linear mappings to align representations from other models into this anchor. These aligned features are then aggregated to form a unified representation for downstream prediction. Across multiple financial NLP tasks, FinAnchor consistently outperforms strong single-model baselines and standard ensemble methods, demonstrating the effectiveness of anchoring heterogeneous representations for robust financial prediction.

22. 【2602.20816】Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

链接：https://arxiv.org/abs/2602.20816

作者：Sayantan Dasgupta,Trevor Cohn,Timothy Baldwin

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：core learning signal, language model distillation, standard Kullback-Leibler, core learning, learning signal

备注：

点击查看摘要

Abstract:The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

23. 【2602.20759】Overton Pluralistic Reinforcement Learning for Large Language Models

链接：https://arxiv.org/abs/2602.20759

作者：Yu Fu,Seongho Son,Ilija Bogunovic

类目：Computation and Language (cs.CL)

关键词：Existing alignment paradigms, alignment paradigms remain, paradigms remain limited, Existing alignment, Overton Pluralism

备注： 28 pages, 8 figures

点击查看摘要

Abstract:Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

24. 【2602.20751】SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

链接：https://arxiv.org/abs/2602.20751

作者：Yifei Xu,Guilherme Potje,Shivam Shandilya,Tiancheng Yuan,Leonardo de Oliveira Nunes,Rakshanda Agarwal,Saeid Asgari,Adam Atkinson,Emre Kıcıman,Songwu Lu,Ranveer Chandra,Tusher Chakraborty

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Designing aligned, open-ended generation remains, aligned and robust, generation remains, remains a key

备注：

点击查看摘要

Abstract:Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

25. 【2602.20749】Explicit Grammar Semantic Feature Fusion for Robust Text Classification

链接：https://arxiv.org/abs/2602.20749

作者：Azrin Sultana,Firoz Ahmed

类目：Computation and Language (cs.CL)

关键词：Natural Language Processing, Language Processing enables, Processing enables computers, understand human language, Natural Language

备注： 30 pages, 9 figures

点击查看摘要

Abstract:Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices

26. 【2602.20743】Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization

链接：https://arxiv.org/abs/2602.20743

作者：Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi

类目：Computation and Language (cs.CL)

关键词：Anonymizing textual documents, highly context-sensitive problem, Anonymizing textual, utility preservation varies, context-sensitive problem

备注：

点击查看摘要

Abstract:Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization strategies are automatically adapted to specific privacy-utility requirements. We propose a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns. To evaluate our approach, we present a benchmark spanning five datasets with diverse domains, privacy constraints, and utility objectives. Across all evaluated settings, our framework consistently achieves a better privacy-utility trade-off than existing baselines, while remaining computationally efficient and effective on open-source language models, with performance comparable to larger closed-source models. Additionally, we show that our method can discover novel anonymization strategies that explore different points along the privacy-utility trade-off frontier.

27. 【2602.20735】RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition

链接：https://arxiv.org/abs/2602.20735

作者：Kun Ran,Marwah Alaofi,Danula Hettiachchi,Chenglong Ma,Khoi Nguyen Dinh Anh,Khoi Vo Nguyen,Sachin Pathiyan Cherumanal,Lida Rashidi,Falk Scholer,Damiano Spina,Shuoqi Sun,Oleg Zendel

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Toggle, Toggle Hugging Face, Code Toggle Papers, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic

备注： MMU-RAG NeurIPS 2025 winning system

点击查看摘要

Abstract:This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG (R2RAG), a research-focused retrieval-augmented generation (RAG) architecture composed of lightweight components that dynamically adapt the retrieval strategy based on inferred query complexity and evidence sufficiency. The system uses smaller LLMs, enabling operation on a single consumer-grade GPU while supporting complex research tasks. It builds on the G-RAG system, winner of the ACM~SIGIR~2025 LiveRAG Challenge, and extends it with modules informed by qualitative review of outputs. R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.

Comments:
MMU-RAG NeurIPS 2025 winning system

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.20735 [cs.IR]

(or
arXiv:2602.20735v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.20735

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Oleg Zendel [view email] [v1]
Tue, 24 Feb 2026 09:58:25 UTC (93 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition, by Kun Ran and 11 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.IR

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.AI
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

信息检索

1. 【2602.21202】Multi-Vector Index Compression in Any Modality

链接：https://arxiv.org/abs/2602.21202

作者：Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：study efficient multi-vector, efficient multi-vector retrieval, late interaction, study efficient, efficient multi-vector

备注： 12 pages, 4 figures

点击查看摘要

2. 【2602.21143】A Benchmark for Deep Information Synthesis

链接：https://arxiv.org/abs/2602.21143

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Large language model, complex tasks involving, tasks involving tool, code execution, solve complex tasks

备注： Accepted at ICLR 2026

点击查看摘要

3. 【2602.21103】Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

链接：https://arxiv.org/abs/2602.21103

作者：Sanket Badhe,Deep Shah

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：test-time inference costs, Advanced reasoning typically, substantial test-time inference, reasoning typically requires, incurs prohibitive latency

备注：

点击查看摘要

4. 【2602.21099】urning Semantics into Topology: LLM-Driven Attribute Augmentation for Collaborative Filtering

链接：https://arxiv.org/abs/2602.21099

作者：Junjie Meng,Ranxu zhang,Wei Wu,Rui Zhang,Chuan Qin,Qi Zhang,Qi Liu,Hui Xiong,Chao Wang

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, Large Language, shown great potential, enhancing recommender systems, reasoning capabilities

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great potential for enhancing recommender systems through their extensive world knowledge and reasoning capabilities. However, effectively translating these semantic signals into traditional collaborative embeddings remains an open challenge. Existing approaches typically fall into two extremes: direct inference methods are computationally prohibitive for large-scale retrieval, while embedding-based methods primarily focus on unilateral feature augmentation rather than holistic collaborative signal enhancement. To bridge this gap, we propose Topology-Augmented Graph Collaborative Filtering (TAGCF), a novel framework that transforms semantic knowledge into topological connectivity. Unlike existing approaches that depend on textual features or direct interaction synthesis, TAGCF employs LLMs to infer interaction intents and underlying causal relationships from user-item pairs, representing these insights as intermediate attribute nodes within an enriched User-Attribute-Item (U-A-I) graph. Furthermore, to effectively model the heterogeneous relations in this augmented structure, we propose Adaptive Relation-weighted Graph Convolution (ARGC), which employs relation-specific prediction networks to dynamically estimate the importance of each relation type. Extensive experiments across multiple benchmark datasets and CF backbones demonstrate consistent improvements, with comprehensive evaluations including cold-start scenarios validating the effectiveness and robustness of our framework. All code will be made publicly available. For anonymous review, our code is available at the following anonymous link: this https URL.

5. 【2602.21052】Position-Aware Sequential Attention for Accurate Next Item Recommendations

链接：https://arxiv.org/abs/2602.21052

作者：Timur Nabiev,Evgeny Frolov

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：inject positional information, additive positional embeddings, models usually rely, Sequential self-attention models, positional

备注：

点击查看摘要

Abstract:Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is permutation-equivariant over sequence positions and thus has no intrinsic notion of temporal order beyond causal masking. We argue that additive positional embeddings make the attention mechanism only superficially sensitive to sequence order: positional information is entangled with item embedding semantics, propagates weakly in deep architectures, and limits the ability to capture rich sequential patterns. To address these limitations, we introduce a kernelized self-attention mechanism, where a learnable positional kernel operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights. When applied per attention block, this kernel enables adaptive multi-scale sequential modeling. Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.

6. 【2602.21009】HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders

链接：https://arxiv.org/abs/2602.21009

作者：Kun Yuan,Junyu Bi,Daixuan Cheng,Changfa Wu,Shuwen Xiao,Binbin Cao,Jian Wu,Yuning Jiang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Modern recommender systems, recommender systems leverage, systems leverage ultra-long, leverage ultra-long user, Modern recommender

备注：

点击查看摘要

7. 【2602.20995】Generative Pseudo-Labeling for Pre-Ranking with LLMs

链接：https://arxiv.org/abs/2602.20995

作者：Junyu Bi,Xinting Niu,Daixuan Cheng,Kun Yuan,Tao Wang,Binbin Cao,Jian Wu,Yuning Jiang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：efficiently scoring thousands, tasked with efficiently, downstream ranking, critical stage, stage in industrial

备注：

点击查看摘要

8. 【2602.20986】Naver Labs Europe @ WSDM CUP | Multilingual Retrieval

链接：https://arxiv.org/abs/2602.20986

作者：Thibault Formal,Maxime Louis,Hervé Déjean,Stéphane Clinchant

类目：Information Retrieval (cs.IR)

关键词：English queries, WSDM Cup, presents our participation, shared task, multilingual document retrieval

备注： Report paper of our submission to the WSDM Cup 2026

点击查看摘要

Abstract:This report presents our participation to the WSDM Cup 2026 shared task on multilingual document retrieval from English queries. The task provides a challenging benchmark for cross-lingual generalization. It also provides a natural testbed for evaluating SPLARE, our recently proposed learned sparse retrieval model, which produces generalizable sparse latent representations and is particularly well suited to multilingual retrieval settings. We evaluate five progressively enhanced runs, starting from a SPLARE-7B model and incorporating lightweight improvements, including reranking with Qwen3-Reranker-4B and simple score fusion strategies. Our results demonstrate the strength of SPLARE compared to state-of-the-art dense baselines such as Qwen3-8B-Embed. More broadly, our submission highlights the continued relevance and competitiveness of learned sparse retrieval models beyond English-centric scenarios.

Comments:
Report paper of our submission to the WSDM Cup 2026

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2602.20986 [cs.IR]

(or
arXiv:2602.20986v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.20986

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

9. 【2602.20877】E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications

链接：https://arxiv.org/abs/2602.20877

作者：Jiwoo Kang,Yeon-Chang Lee

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Multimodal recommender systems, enhance collaborative filtering, leveraging item-side modalities, task-specific objectives limits, Multimodal Knowledge Graph

备注：

点击查看摘要

Abstract:Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalization. We propose E-MMKGR, a framework that constructs an e-commerce-specific Multimodal Knowledge Graph E-MMKG and learns unified item representations through GNN-based propagation and KG-oriented optimization. These representations provide a shared semantic foundation applicable to diverse tasks. Experiments on real-world Amazon datasets show improvements of up to 10.18% in Recall@10 for recommendation and up to 21.72% over vector-based retrieval for product search, demonstrating the effectiveness and extensibility of our approach.

10. 【2602.20800】Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking

链接：https://arxiv.org/abs/2602.20800

作者：Dalia Nahhas,Xiaohao Cai,Imran Razzak,Shoaib Jameel

类目：Information Retrieval (cs.IR)

关键词：Generative Information Retrieval, Information Retrieval, Generative Information, selection of candidates, bottleneck has shifted

备注：

点击查看摘要

Abstract:In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

11. 【2602.20735】RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition

链接：https://arxiv.org/abs/2602.20735

作者：Kun Ran,Marwah Alaofi,Danula Hettiachchi,Chenglong Ma,Khoi Nguyen Dinh Anh,Khoi Vo Nguyen,Sachin Pathiyan Cherumanal,Lida Rashidi,Falk Scholer,Damiano Spina,Shuoqi Sun,Oleg Zendel

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Toggle, Toggle Hugging Face, Code Toggle Papers, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic

备注： MMU-RAG NeurIPS 2025 winning system

点击查看摘要

Comments:
MMU-RAG NeurIPS 2025 winning system

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.20735 [cs.IR]

(or
arXiv:2602.20735v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.20735

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Oleg Zendel [view email] [v1]
Tue, 24 Feb 2026 09:58:25 UTC (93 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition, by Kun Ran and 11 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.IR

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.AI
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

计算机视觉

1. 【2602.21204】st-Time Training with KV Binding Is Secretly Linear Attention

链接：https://arxiv.org/abs/2602.21204

作者：Junchen Liu,Sven Elflein,Or Litany,Zan Gojcic,Ruilong Li

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：sequence modeling layer, test time, binding as sequence, sequence modeling, modeling layer

备注： Webpage: [this https URL](https://research.nvidia.com/labs/sil/projects/tttla/)

点击查看摘要

Abstract:Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

2. 【2602.21203】Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

链接：https://arxiv.org/abs/2602.21203

作者：Abdulaziz Almuzairee,Henrik I. Christensen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Visual reinforcement learning, on-policy methods parallelize, robotics but expensive, sample-efficient yet slow, waste samples

备注： For website and code, see [this https URL](https://aalmuzairee.github.io/squint)

点击查看摘要

Abstract:Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

3. 【2602.21202】Multi-Vector Index Compression in Any Modality

链接：https://arxiv.org/abs/2602.21202

作者：Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：study efficient multi-vector, efficient multi-vector retrieval, late interaction, study efficient, efficient multi-vector

备注： 12 pages, 4 figures

点击查看摘要

4. 【2602.21198】Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

链接：https://arxiv.org/abs/2602.21198

作者：Yining Hong,Huang Huang,Manling Li,Li Fei-Fei,Jiajun Wu,Yejin Choi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Embodied LLMs endow, high-level task reasoning, LLMs endow robots, Embodied LLMs, Reflective Test-Time Planning

备注：

点击查看摘要

5. 【2602.21195】Region of Interest Segmentation and Morphological Analysis for Membranes in Cryo-Electron Tomography

链接：https://arxiv.org/abs/2602.21195

作者：Xingyi Cheng,Julien Maufront,Aurélie Di Cicco,Daniël M. Pelt,Manuela Dezi,Daniel Lévy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cryo-electron tomography, enables high resolution, high resolution, three-dimensional reconstruction, reconstruction of biological

备注：

点击查看摘要

Abstract:Cryo-electron tomography (cryo-ET) enables high resolution, three-dimensional reconstruction of biological structures, including membranes and membrane proteins. Identification of regions of interest (ROIs) is central to scientific imaging, as it enables isolation and quantitative analysis of specific structural features within complex datasets. In practice, however, ROIs are typically derived indirectly through full structure segmentation followed by post hoc analysis. This limitation is especially apparent for continuous and geometrically complex structures such as membranes, which are segmented as single entities. Here, we developed TomoROIS-SurfORA, a two step framework for direct, shape-agnostic ROI segmentation and morphological surface analysis. TomoROIS performs deep learning-based ROI segmentation and can be trained from scratch using small annotated datasets, enabling practical application across diverse imaging data. SurfORA processes segmented structures as point clouds and surface meshes to extract quantitative morphological features, including inter-membrane distances, curvature, and surface roughness. It supports both closed and open surfaces, with specific considerations for open surfaces, which are common in cryo-ET due to the missing wedge effect. We demonstrate both tools using in vitro reconstituted membrane systems containing deformable vesicles with complex geometries, enabling automatic quantitative analysis of membrane contact sites and remodeling events such as invagination. While demonstrated here on cryo-ET membrane data, the combined approach is applicable to ROI detection and surface analysis in broader scientific imaging contexts.

6. 【2602.21188】Human Video Generation from a Single Image with 3D Pose and View Control

链接：https://arxiv.org/abs/2602.21188

作者：Tiantian Wang,Chun-Han Yao,Tao Hu,Mallikarjun Byrasandra Ramalinga Reddy,Ming-Hsuan Yang,Varun Jampani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made significant progress, visual generation capabilities, powerful visual generation, human video generation, Recent diffusion methods

备注：

点击查看摘要

Abstract:Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

7. 【2602.21186】Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

链接：https://arxiv.org/abs/2602.21186

作者：Haoyi Jiang,Liu Liu,Xinjie Wang,Yonghao He,Wei Sui,Zhizhong Su,Wenyu Liu,Xinggang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：exhibit exceptional, remains superficial, ability to comprehend, comprehend and reason, Vision-Language Models

备注：

点击查看摘要

Abstract:While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at this https URL.

8. 【2602.21179】Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision

链接：https://arxiv.org/abs/2602.21179

作者：Nicolás Gaggion,Maria J. Ledesma-Carbayo,Stergios Christodoulidis,Maria Vakalopoulou,Enzo Ferrante

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Graph-based medical image, providing fixed-topology landmarks, medical image segmentation, image segmentation represents, inherent population-level correspondences

备注：

点击查看摘要

Abstract:Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.

9. 【2602.21178】XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence

链接：https://arxiv.org/abs/2602.21178

作者：Sepehr Salem Ghahfarokhi,M. Moein Esfahani,Raj Sunderraman,Vince Calhoun,Mohammed Alser

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：significantly advanced automated, clinical adoption remains, adoption remains limited, advanced automated brain, Deep learning

备注： Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

点击查看摘要

Abstract:Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ''black boxes'' and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: this https URL.

10. 【2602.21175】Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

链接：https://arxiv.org/abs/2602.21175

作者：Jianglin Lu,Simon Jenni,Kushal Kafle,Jing Shi,Handong Zhao,Yun Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fundamental task, real-world scenarios, queries, vision-language learning, quality

备注：

点击查看摘要

Abstract:Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at this https URL.

11. 【2602.21172】NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

链接：https://arxiv.org/abs/2602.21172

作者：Ishaan Rawal,Shubh Gupta,Yihan Hu,Wei Zhan

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：replacing modular pipelines, models are advancing, pipelines with unified, advancing autonomous driving, driving by replacing

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.

12. 【2602.21153】SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement

链接：https://arxiv.org/abs/2602.21153

作者：Bastien Gimbert

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：skeletal animation frameworks, triangle meshes compatible, present SPRITETOMESH, fully automatic pipeline, fully automatic

备注： 11 pages, 17 figures. Code available at [this https URL](https://github.com/BastienGimbert/SpriteToMesh)

点击查看摘要

Abstract:We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.

13. 【2602.21142】LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis

链接：https://arxiv.org/abs/2602.21142

作者：Zhifan Jiang,Dong Yang,Vishwesh Nath,Abhijeet Parida,Nishad P. Kulkarni,Ziyue Xu,Daguang Xu,Syed Muhammad Anwar,Holger R. Roth,Marius George Linguraru

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large vision-language models, Large vision-language, clinical domain, evolved from general-purpose, specialized use cases

备注： Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.

14. 【2602.21141】SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

链接：https://arxiv.org/abs/2602.21141

作者：Jose Moises Araya-Martinez,Thushar Tom,Adrián Sanchis Reig,Pablo Rey Valiente,Jens Lambrecht,Jörg Krüger

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Object perception, robotic material handling, quality inspection, fundamental for tasks, material handling

备注：

点击查看摘要

Abstract:Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

15. 【2602.21137】UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

链接：https://arxiv.org/abs/2602.21137

作者：Joseph Raj Vishal,Nagasiri Poluri,Katha Naik,Rutuja Patil,Kashyap Hegde Kota,Krishna Vinod,Prithvi Jai Ramesh,Mohammad Farhadi,Yezhou Yang,Bharatesh Chakravarthi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：urban traffic remains, dynamic urban scenes, introduces Urban Dynamics, Urban Dynamics VideoQA, multi-agent dynamics

备注：

点击查看摘要

Abstract:Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at this https URL.

16. 【2602.21105】BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

链接：https://arxiv.org/abs/2602.21105

作者：Jiaxing Yu,Dongyang Ren,Hangyu Xu,Zhouyuxiao Yang,Yuanqi Li,Jie Guo,Zhengkang Zhou,Yanwen Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：B-rep Gaussian Splatting, trimmed corners, Recovering B-rep representation, explicit boundaries, propose B-rep Gaussian

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.

17. 【2602.21101】Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones

链接：https://arxiv.org/abs/2602.21101

作者：Rong Zou,Marco Cannici,Davide Scaramuzza

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：promise rapid inspection, limited battery constraints, aerial robots promise, robots promise rapid, Fast-flying aerial robots

备注：

点击查看摘要

Abstract:Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.

18. 【2602.21100】Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

链接：https://arxiv.org/abs/2602.21100

作者：Noé Artru,Rukhshanda Hussain,Emeline Got,Alexandre Messier,David B. Lindell,Abdallah Dib

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：face fundamental limitations, existing methods face, head geometry, range of applications, Reconstructing high-fidelity

备注： 14 pages, 8 figures, to be published in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

点击查看摘要

Abstract:Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.

19. 【2602.21098】Optimizing Occupancy Sensor Placement in Smart Environments

链接：https://arxiv.org/abs/2602.21098

作者：Hao Lu,Richard J. Radke

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：realizing energy savings, Understanding the locations, commercial built environment, delivering lighting, commercial built

备注：

点击查看摘要

Abstract:Understanding the locations of occupants in a commercial built environment is critical for realizing energy savings by delivering lighting, heating, and cooling only where it is needed. The key to achieving this goal is being able to recognize zone occupancy in real time, without impeding occupants' activities or compromising privacy. While low-resolution, privacy-preserving time-of-flight (ToF) sensor networks have demonstrated good performance in zone counting, the performance depends on careful sensor placement. To address this issue, we propose an automatic sensor placement method that determines optimal sensor layouts for a given number of sensors, and can predict the counting accuracy of such a layout. In particular, given the geometric constraints of an office environment, we simulate a large number of occupant trajectories. We then formulate the sensor placement problem as an integer linear programming (ILP) problem and solve it with the branch and bound method. We demonstrate the effectiveness of the proposed method based on simulations of several different office environments.

20. 【2602.21078】ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning

链接：https://arxiv.org/abs/2602.21078

作者：Duowen Chen,Yan Wang

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Federated Semi-Supervised Learning, Federated Semi-Supervised, Semi-Supervised Learning, leveraging partially-annotated local, aims to collaboratively

备注： CVPR 2026. code: [this https URL](https://github.com/DuowenC/FSSLlib)

点击查看摘要

Abstract:Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low-confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re-include the discarded samples into training by a positive-negative proxy pool to mitigate the impact of potentially-incorrect pseudo-labels. Insight experiments theoretical analysis show our significant performance and convergence in FSSL.

21. 【2602.21064】Motivation is Something You Need

链接：https://arxiv.org/abs/2602.21064

作者：Mehdi Acheli,Walid Gaaloul

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：affective neuroscience, work introduces, paradigm that draws, draws from affective, SEEKING motivational state

备注：

点击查看摘要

Abstract:This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined "motivation conditions". The framework mimics the emotional state of high curiosity and anticipation of reward in which broader brain regions are recruited to enhance cognitive performance. Exploiting scalable architectures where larger models extend smaller ones, our method enables shared weight updates and selective expansion of network capacity during noteworthy training steps. Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch. This opens the possibility of simultaneously training two models tailored to different deployment constraints with competitive or superior performance while keeping training cost lower than when training the larger model.

22. 【2602.21054】VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

链接：https://arxiv.org/abs/2602.21054

作者：Seongheon Park,Changdae Oh,Hyeong Kyu Choi,Xuefeng Du,Sharon Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Vision-Language Models, Large Vision-Language, frequently hallucinate, limiting their safe, real-world applications

备注：

点击查看摘要

23. 【2602.21053】OCR-Agent: Agentic OCR with Capability and Memory Reflection

链接：https://arxiv.org/abs/2602.21053

作者：Shimin Wen,Zeyu Zhang,Xingdou Bian,Hongjie Zhu,Lulu He,Layi Shama,Daji Ergu,Ying Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, rectify cognitive biases, demonstrated significant potential, generally lack effective, independently rectify cognitive

备注：

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization this http URL, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer this http URL address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: this https URL.

24. 【2602.21042】OmniOCR: Generalist OCR for Ethnic Minority Languages

链接：https://arxiv.org/abs/2602.21042

作者：Bonan Liu,Zeyu Zhang,Bingbing Meng,Han Wang,Hanshuo Zhang,Chengping Wang,Daji Ergu,Ying Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Optical character recognition, Latin and Chinese, Optical character, character recognition, advanced rapidly

备注：

点击查看摘要

Abstract:Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: this https URL.

25. 【2602.21035】Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

链接：https://arxiv.org/abs/2602.21035

作者：Junhao Xiao,Zhiyu Wu,Hao Lin,Yi Chen,Yahui Liu,Xiaoran Zhao,Zixu Wang,Zejiang He

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Vision-Language Models, dog images, negatives similarly, struggle to understand, affirmatives and negatives

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

26. 【2602.21033】MIP Candy: A Modular PyTorch Framework for Medical Image Processing

链接：https://arxiv.org/abs/2602.21033

作者：Tianhao Fu,Yucheng Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：heterogeneous file formats, demands specialized software, handles high-dimensional volumetric, processing demands specialized, Medical image processing

备注：

点击查看摘要

Abstract:Medical image processing demands specialized software that handles high-dimensional volumetric data, heterogeneous file formats, and domain-specific training procedures. Existing frameworks either provide low-level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch-based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while retaining fine-grained control over every component. Central to the design is $\texttt{LayerT}$, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built-in $k$-fold cross-validation, dataset inspection with automatic region-of-interest detection, deep supervision, exponential moving average, multi-frontend experiment tracking (Weights Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre-built model implementations that follow a consistent trainer--predictor pattern and integrate with the core framework without modification. MIPCandy is open-source under the Apache-2.0 license and requires Python~3.12 or later. Source code and documentation are available at this https URL.

27. 【2602.21015】From Perception to Action: An Interactive Benchmark for Vision Reasoning

链接：https://arxiv.org/abs/2602.21015

作者：Yuhao Wu,Maojia Song,Yihuai Lan,Lei Wang,Zhiqiang Hu,Yao Xiao,Heng Zhou,Weihua Zheng,Dylan Raharja,Soujanya Poria,Roy Ka-Wei Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：embodied agents, essential for real-world, real-world applications, interactive design, Understanding

备注： Work in processing. Website: [this https URL](https://social-ai-studio.github.io/CHAIN/)

点击查看摘要

Abstract:Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at this https URL.

28. 【2602.21010】Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design

链接：https://arxiv.org/abs/2602.21010

作者：Jiannan Huang,Aditya Kane,Fengzhe Zhou,Yunchao Wei,Humphrey Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：textbf, requires high accuracy, Real-time object detection, real-time DETR models, real-time DETR

备注： CVPR Findings

点击查看摘要

Abstract:Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbf{L}ow-cost and \textbf{E}fficient \textbf{DE}tection \textbf{TR}ansformer), which achieves a new \textbf{SOTA} in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80\% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf{52.9/54.3/55.1 mAP} on COCO Val2017 with \textbf{4.45/5.01/6.68 ms} on an RTX4090. It surpasses YOLOv12-L/X by \textbf{+0.6/-0.1 mAP} while achieving similar speed and \textbf{+20\%} speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf{+0.2 mAP} with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf{+0.4 mAP} with only \textbf{0.4 ms} additional latency. Code and weights will be open-sourced.

29. 【2602.20999】VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

链接：https://arxiv.org/abs/2602.20999

作者：Bowen Zheng,Yongli Xiang,Ziming Hong,Zerong Lin,Chaojian Yu,Tongliang Liu,Xinge You

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：implicit control signals, shown emerging visual, emerging visual instruction-following, condition video generation, visual instruction-following capability

备注： Project page: [this https URL](https://Zbwwwwwwww.github.io/VII)

点击查看摘要

Abstract:Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.

30. 【2602.20989】Cycle-Consistent Tuning for Layered Image Decomposition

链接：https://arxiv.org/abs/2602.20989

作者：Zheng Gu,Min Lu,Zhida Sun,Dani Lischinski,Daniel Cohen-O,Hui Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Disentangling visual layers, Disentangling visual, globally coupled interactions, including shading, vision and graphics

备注： Accepted to CVPR 2026. Project page: [this https URL](https://vcc.tech/research/2026/ImgDecom)

点击查看摘要

Abstract:Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.

31. 【2602.20985】EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer

链接：https://arxiv.org/abs/2602.20985

作者：Munish Monga,Vishal Chudasama,Pankaj Wasnik,C.V. Jawahar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Real-world object detection, accessing prior data, Real-world object, World Object Detection, Evolving World Object

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as "unknown": all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.

32. 【2602.20981】Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

链接：https://arxiv.org/abs/2602.20981

作者：Christian Simon,MAsato Ishii,Wei-Yao Wang,Koichi Saito,Akio Hayakawa,Dongseok Shim,Zhi Zhong,Shuyang Cui,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：frame-level video information, Scaling multimodal alignment, video information, frame-level video, due to limited

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

33. 【2602.20980】CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

链接：https://arxiv.org/abs/2602.20980

作者：Yang Zhang,Danyang Li,Yuxuan Li,Xin Zhang,Tianyu Xie,Mingming Cheng,Xiang Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multimodal Large Language, integrating powerful language, powerful language backbones, Multimodal Large

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

34. 【2602.20972】Are Multimodal Large Language Models Good Annotators for Image Tagging?

链接：https://arxiv.org/abs/2602.20972

作者：Ming-Kun Xie,Jia-Hao Xiao,Zhiqiang Kou,Zhongnian Li,Gang Niu,Masashi Sugiyama

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fundamental vision task, train multi-label classifiers, incurs significant labor, Large Language Models, Multimodal Large Language

备注：

点击查看摘要

Abstract:Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50\% to 80\% of human performance, while achieving over 90\% performance on downstream training this http URL by these findings, we propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations. TagLLM comprises two components: Candidates generation, which employs structured group-wise prompting to efficiently produce a compact candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and label disambiguation, which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels. Extensive experiments show that TagLLM substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60\% to 80\% of the difference.

35. 【2602.20951】See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

链接：https://arxiv.org/abs/2602.20951

作者：Jaehyun Park,Minyoung Ahn,Minkyu Kim,Jonghyun Lee,Jae-Gil Lee,Dongmin Park

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：compromise realism, recent advances, visual artifacts, generated images, artifacts

备注：

点击查看摘要

Abstract:Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.

36. 【2602.20947】Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation

链接：https://arxiv.org/abs/2602.20947

作者：Thorbjørn Mosekjær Iversen,Zebin Duan,Frederik Hagelskjær

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Wilson Score Kernel, Score Kernel Density, deep learning-based binary, recent years, learning-based binary classifiers

备注：

点击查看摘要

Abstract:The performance and ease of use of deep learning-based binary classifiers have improved significantly in recent years. This has opened up the potential for automating critical inspection tasks, which have traditionally only been trusted to be done manually. However, the application of binary classifiers in critical operations depends on the estimation of reliable confidence bounds such that system performance can be ensured up to a given statistical significance. We present Wilson Score Kernel Density Classification, which is a novel kernel-based method for estimating confidence bounds in binary classification. The core of our method is the Wilson Score Kernel Density Estimator, which is a function estimator for estimating confidence bounds in Binomial experiments with conditionally varying success probabilities. Our method is evaluated in the context of selective classification on four different datasets, illustrating its use as a classification head of any feature extractor, including vision foundation models. Our proposed method shows similar performance to Gaussian Process Classification, but at a lower computational complexity.

37. 【2602.20943】UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

链接：https://arxiv.org/abs/2602.20943

作者：Kaiyuan Tan,Yingying Shen,Mingfei Tu,Haohui Zhu,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving simulation, closed-loop learning, critical for autonomous, simulation and closed-loop, feed-forward methods

备注：

点击查看摘要

Abstract:Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.

38. 【2602.20933】Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting

链接：https://arxiv.org/abs/2602.20933

作者：Shuangkang Fang,I-Chao Shen,Xuanyang Zhang,Zesheng Wang,Yufeng Wang,Wenrui Ding,Gang Yu,Takeo Igarashi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：nullifying Gaussian opacities, Gaussian Splatting, randomly nullifying Gaussian, Gaussian opacities, sparse-view conditions

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recent 3D Gaussian Splatting (3DGS) Dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based Dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies near anchors and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the Dropout to color attributes by randomly dropping higher-degree SH to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing Dropout methods with negligible computational overhead, and can be readily integrated into various 3DGS variants to enhance their performances. Project Website: this https URL

39. 【2602.20930】Computing a Characteristic Orientation for Rotation-Independent Image Analysis

链接：https://arxiv.org/abs/2602.20930

作者：Cristian Valero-Abundio,Emilio Sansano-Sansano,Raúl Montoliu,Marina Martínez García

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Handling geometric transformations, Handling geometric, geometric transformations, computer vision, challenge in deep

备注： Accepted for publication at the 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026). 8 pages

点击查看摘要

Abstract:Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convolutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.

40. 【2602.20925】LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments

链接：https://arxiv.org/abs/2602.20925

作者：Zeyu Jiang,Kuan Xu,Changhao Chen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Thermal cameras offer, offer strong potential, cameras offer strong, thermal Simultaneous Localization, weather conditions

备注： ICRA 2026

点击查看摘要

Abstract:Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semantic-geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.

41. 【2602.20913】LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

链接：https://arxiv.org/abs/2602.20913

作者：Jihao Qiu,Lingxi Xie,Xinyue Huo,Qi Tian,Qixiang Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low computational budgets, computational budgets, paper addresses, addresses the critical, critical and underexplored

备注： 17 pages, 9 figures, 8 tables, accepted to CVPR 2026

点击查看摘要

Abstract:This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: this https URL

42. 【2602.20911】From Isolation to Integration: Building an Adaptive Expert Forest for Pre-Trained Model-based Class-Incremental Learning

链接：https://arxiv.org/abs/2602.20911

作者：Ruiqi Liu,Boyu Diao,Hangda Liu,Zhulin An,Fei Wang,Yongjun Xu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Class-Incremental Learning, learn new classes, requires models, CIL, Learning

备注：

点击查看摘要

Abstract:Class-Incremental Learning (CIL) requires models to learn new classes without forgetting old ones. A common method is to freeze a pre-trained model and train a new, lightweight adapter for each task. While this prevents forgetting, it treats the learned knowledge as a simple, unstructured collection and fails to use the relationships between tasks. To this end, we propose the Semantic-guided Adaptive Expert Forest (SAEF), a new method that organizes adapters into a structured hierarchy for better knowledge sharing. SAEF first groups tasks into conceptual clusters based on their semantic relationships. Then, within each cluster, it builds a balanced expert tree by creating new adapters from merging the adapters of similar tasks. At inference time, SAEF finds and activates a set of relevant experts from the forest for any given input. The final prediction is made by combining the outputs of these activated experts, weighted by how confident each expert is. Experiments on several benchmark datasets show that SAEF achieves SOTA performance.

43. 【2602.20903】xtPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

链接：https://arxiv.org/abs/2602.20903

作者：Hanshen Zhu,Yuliang Liu,Xuecheng Wu,An-Lan Wang,Hao Feng,Dingkang Yang,Chao Feng,Can Huang,Jingqun Tang,Xiang Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advanced models frequently, models frequently produce, frequently produce text, frequently produce, specialist OCR models

备注： Code: [this https URL](https://github.com/CIawevy/TextPecker)

点击查看摘要

Abstract:Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

44. 【2602.20901】SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

链接：https://arxiv.org/abs/2602.20901

作者：Yuechen Xie,Xiaoyan Zhang,Yicheng Shan,Hao Zhu,Rui Tang,Rong Wei,Mingli Song,Yuanyu Wan,Jie Song

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：spatial logical reasoning, spatial logical, real-world scenarios due, logical reasoning, logical

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at this https URL.

45. 【2602.20880】When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

链接：https://arxiv.org/abs/2602.20880

作者：Yongli Xiang,Ziming Hong,Zhaoqing Wang,Xiangyu Zhao,Bo Han,Tongliang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generating high-quality images, demonstrated significant advancements, raising potential safety, potential safety concerns, high-quality images

备注： CVPR 2026; Code is released at [this https URL](https://github.com/tmllab/2026_CVPR_CASG)

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.

46. 【2602.20873】MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification

链接：https://arxiv.org/abs/2602.20873

作者：Jiahao Xu,Sheng Huang,Xin Zhang,Zhixiong Nan,Jiajun Dong,Nankun Mu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：slide image classification, slide image, expert-labeled slides, image classification, classification is primarily

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: this https URL.

47. 【2602.20860】DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation

链接：https://arxiv.org/abs/2602.20860

作者：Wangkai Li,Rui Sun,Zhaoyang Li,Yujia Chen,Tianzhu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：methods greatly enhance, unsupervised domain adaptation, greatly enhance target, methods greatly, resulting in misalignment

备注：

点击查看摘要

Abstract:While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy -- a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo-labels to hard pseudo-labels. Based on this finding, we propose DA-Cal, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization. DA-Cal introduces a Meta Temperature Network to generate pixel-level calibration parameters and employs bi-level optimization to establish the relationship between soft pseudo-labels and UDA supervision, while utilizing complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.

48. 【2602.20853】On the Explainability of Vision-Language Models in Art History

链接：https://arxiv.org/abs/2602.20853

作者：Stefanie Schneider

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：shared embedding space, Vision-Language Models, embedding space, Explainable Artificial Intelligence, textual data

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.

49. 【2602.20851】Hybrid Fusion: One-Minute Efficient Training for Zero-Shot Cross-Domain Image Fusion

链接：https://arxiv.org/abs/2602.20851

作者：Ran Zhang,Xuanhua He,Liu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image fusion seeks, superior image, integrate complementary information, seeks to integrate, integrate complementary

备注：

点击查看摘要

Abstract:Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve state-of-the-art (SOTA) results but suffer from critical inefficiencies: their reliance on slow, resource-intensive, patch-based training introduces a significant gap with full-resolution inference. We propose a novel hybrid framework that resolves this trade-off. Our method utilizes a learnable U-Net to generate a dynamic guidance map that directs a classic, fixed Laplacian pyramid fusion kernel. This decoupling of policy learning from pixel synthesis enables remarkably efficient full-resolution training, eliminating the train-inference gap. Consequently, our model achieves SOTA-comparable performance in about one minute on a RTX 4090 or two minutes on a consumer laptop GPU from scratch without any external model and demonstrates powerful zero-shot generalization across diverse tasks, from infrared-visible to medical imaging. By design, the fused output is linearly constructed solely from source information, ensuring high faithfulness for critical applications. The codes are available at this https URL

50. 【2602.20845】FLIM Networks with Bag of Feature Points

链接：https://arxiv.org/abs/2602.20845

作者：João Deltregia Martinelli,Marcelo Luis Rodrigues Filho,Felipe Crispim da Rocha Salvagnini,Gilson Junior Soares,Jefersson A. dos Santos,Alexandre X. Falcão

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional networks require, extensive image annotation, require extensive image, networks require extensive, Convolutional networks

备注： Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer

点击查看摘要

Abstract:Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder's block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.

51. 【2602.20839】raining-Free Multi-Concept Image Editing

链接：https://arxiv.org/abs/2602.20839

作者：Niki Foteinopoulou,Ignas Budvytis,Stephan Liwicki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：training remains challenging, remains challenging, training remains, Abstract, unifies Optimised DDS

备注： 17 pages, 13 figures

点击查看摘要

Abstract:Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.

52. 【2602.20818】GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection

链接：https://arxiv.org/abs/2602.20818

作者：Yingying Guo,Ke Zhang,Zirong Zeng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Detecting hateful content, presents unique challenges, memes presents unique, multimodal memes presents, Detecting hateful

备注： Preprint

点击查看摘要

Abstract:Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP's multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.

53. 【2602.20807】RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

链接：https://arxiv.org/abs/2602.20807

作者：Yangfan Zhao,Hanwei Zhang,Ke Huang,Qiufeng Wang,Zhenzhou Shao,Dengyu Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Simultaneous Localization, Gaussian Splatting SLAM, Gaussian splatting, enables continuous, gained popularity

备注：

点击查看摘要

Abstract:Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: this https URL

54. 【2602.20794】VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

链接：https://arxiv.org/abs/2602.20794

作者：Jie Wang,Guang Li,Zhijian Huang,Chenxu Dang,Hangjun Ye,Yahong Han,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：geometric modeling capabilities, autonomous driving, existing Vision-Language Models, cross-view geometric grounding, inherently lack

备注： CVPR 2026

点击查看摘要

Abstract:The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing QA data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

55. 【2602.20792】SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking

链接：https://arxiv.org/abs/2602.20792

作者：Muhammad Saif Ullah Khan,Didier Stricker

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex multi-joint kinematics, spine complex multi-joint, lack of large-scale, fundamental to understanding, remains underexplored

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.

56. 【2602.20790】Real-time Motion Segmentation with Event-based Normal Flow

链接：https://arxiv.org/abs/2602.20790

作者：Sheng Zhong,Zhongyang Ren,Xiya Zhu,Dehao Yuan,Cornelia Fermuller,Yi Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：handle visual tasks, microsecond resolution, offering the potential, challenging scenarios, cameras are bio-inspired

备注：

点击查看摘要

Abstract:Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework.

57. 【2602.20773】Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization

链接：https://arxiv.org/abs/2602.20773

作者：Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image analysis, remains difficult due, Artificial intelligence, models remains difficult, privacy-constrained imaging data

备注： Submitted to IEEE JBHI

点击查看摘要

Abstract:Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.

58. 【2602.20752】OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation

链接：https://arxiv.org/abs/2602.20752

作者：Tian Lan,Lei Xu,Zimu Yuan,Shanggui Liu,Jiajun Liu,Jiaxin Liu,Weilai Xiang,Hongyu Yang,Dong Jiang,Jianxin Yin,Dingyu Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：global health burden, Musculoskeletal disorders represent, significant global health, musculoskeletal MRI interpretation, disability worldwide

备注：

点击查看摘要

Abstract:Musculoskeletal disorders represent a significant global health burden and are a leading cause of disability worldwide. While MRI is essential for accurate diagnosis, its interpretation remains exceptionally challenging. Radiologists must identify multiple potential abnormalities within complex anatomical structures across different imaging planes, a process that requires significant expertise and is prone to variability. We developed OrthoDiffusion, a unified diffusion-based foundation model designed for multi-task musculoskeletal MRI interpretation. The framework utilizes three orientation-specific 3D diffusion models, pre-trained in a self-supervised manner on 15,948 unlabeled knee MRI scans, to learn robust anatomical features from sagittal, coronal, and axial views. These view-specific representations are integrated to support diverse clinical tasks, including anatomical segmentation and multi-label diagnosis. Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities. The model exhibited remarkable robustness across different clinical centers and MRI field strengths, consistently outperforming traditional supervised models. Notably, in settings where labeled data was scarce, OrthoDiffusion maintained high diagnostic precision using only 10\% of training labels. Furthermore, the anatomical representations learned from knee imaging proved highly transferable to other joints, achieving strong diagnostic performance across 11 diseases of the ankle and shoulder. These findings suggest that diffusion-based foundation models can serve as a unified platform for multi-disease diagnosis and anatomical segmentation, potentially improving the efficiency and accuracy of musculoskeletal MRI interpretation in real-world clinical workflows.

59. 【2602.20739】PyVision-RL: Forging Open Agentic Vision Models via RL

链接：https://arxiv.org/abs/2602.20739

作者：Shitian Zhao,Shaoheng Lin,Ming Li,Haoquan Zhang,Wenshuo Peng,Kaipeng Zhang,Chen Wei

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：agentic multimodal models, Reinforcement learning, reinforcement learning framework, agentic behavior, limiting the benefits

备注： preprint

点击查看摘要

Abstract:Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

60. 【2602.20731】Communication-Inspired Tokenization for Structured Image Representations

链接：https://arxiv.org/abs/2602.20731

作者：Aram Davtyan,Yusuf Sahin,Yasaman Haghighi,Sebastian Stapf,Pablo Acuaviva,Alexandre Alahi,Paolo Favaro

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：multimodal systems, transformer-based architectures, Discrete image tokenizers, tokenizers have emerged, key component

备注： Project website: [this https URL](https://araachie.github.io/comit/)

点击查看摘要

Abstract:Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

61. 【2602.20725】Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation

链接：https://arxiv.org/abs/2602.20725

作者：Junwei Shu,Wenjie Liu,Changgu Chen,Hantang Liu,Yang Li,Changbo Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：producing realistic content, limited explicit control, image generators excel, Diffusion-based image generators, generators excel

备注： preprint

点击查看摘要

Abstract:Diffusion-based image generators excel at producing realistic content from text or image conditions, but they offer only limited explicit control over low-level, physically grounded shading and material properties. In contrast, physically based rendering (PBR) offers fine-grained physical control but lacks prompt-driven flexibility. Although these two paradigms originate from distinct communities, both share a common evolution -- from noisy observations to clean images. In this paper, we propose a unified stochastic formulation that bridges Monte Carlo rendering and diffusion-based generative modeling. First, a general stochastic differential equation (SDE) formulation for Monte Carlo integration under the Central Limit Theorem is modeled. Through instantiation via physically based path tracing, we convert it into a physically grounded SDE representation. Moreover, we provide a systematic analysis of how the physical characteristics of path tracing can be extended to existing diffusion models from the perspective of noise variance. Extensive experiments across multiple tasks show that our method can exert physically grounded control over diffusion-generated results, covering tasks such as rendering and material editing.

62. 【2602.20721】CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization

链接：https://arxiv.org/abs/2602.20721

作者：Xiaoman Feng,Mingkun Lei,Yang Wang,Dingwen Fu,Chi Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Style, style image undesirably, Style transfer, style embedding, tail components

备注： 26 pages

点击查看摘要

Abstract:Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.

63. 【2602.20718】Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization

链接：https://arxiv.org/abs/2602.20718

作者：Yangsen Chen,Hao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieving robot-assisted surgery, Reconstructing deformable endoscopic, Gaussian Splatting, Reconstructing deformable, robot-assisted surgery

备注： ijcnn 2025

点击查看摘要

Abstract:Reconstructing deformable endoscopic tissues is crucial for achieving robot-assisted surgery. However, 3D Gaussian Splatting-based approaches encounter challenges in achieving consistent tissue surface reconstruction, while existing NeRF-based methods lack real-time rendering capabilities. In pursuit of both smooth deformable surfaces and real-time rendering, we introduce a novel approach based on 3D Gaussian Splatting. Specifically, we introduce surface-aware reconstruction, initially employing a Sign Distance Field-based method to construct a mesh, subsequently utilizing this mesh to constrain the Gaussian Splatting reconstruction process. Furthermore, to ensure the generation of physically plausible deformations, we incorporate local rigidity and global non-rigidity restrictions to guide Gaussian deformation, tailored for the highly deformable nature of soft endoscopic tissue. Based on 3D Gaussian Splatting, our proposed method delivers a fast rendering process and smooth surface appearances. Quantitative and qualitative analysis against alternative methodologies shows that our approach achieves solid reconstruction quality in both textures and geometries.

64. 【2602.20709】Onboard-Targeted Segmentation of Straylight in Space Camera Sensors

链接：https://arxiv.org/abs/2602.20709

作者：Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：space camera faults, artificial intelligence, study details, details an artificial, camera faults

备注： Submitted to Aerospace Science and Technology

点击查看摘要

Abstract:This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults. Specifically, we address the segmentation of straylight effects induced by solar presence around the camera's Field of View (FoV). Anomalous images are sourced from our published dataset. Our approach emphasizes generalization across diverse flare textures, leveraging pre-training on a public dataset (Flare7k++) including flares in various non-space contexts to mitigate the scarcity of realistic space-specific data. A DeepLabV3 model with MobileNetV3 backbone performs the segmentation task. The model design targets deployment in spacecraft resource-constrained hardware. Finally, based on a proposed interface between our model and the onboard navigation pipeline, we develop custom metrics to assess the model's performance in the system-level context.

65. 【2602.20700】NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image

链接：https://arxiv.org/abs/2602.20700

作者：Anna Badalyan,Pratheba Selvaraju,Giorgio Becherini,Omid Taheri,Victoria Fernandez Abrevaya,Michael Black

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Estimating sewing patterns, Estimating sewing, creating high-quality, language, Estimating

备注： 10 pages, 7 figures

点击查看摘要

Abstract:Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.

66. 【2602.20689】MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision

链接：https://arxiv.org/abs/2602.20689

作者：Bedrettin Cetinkaya,Sinan Kalkan,Emre Akbas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：edge maps remains, Generating crisp, edge detection, affecting both traditional, maps remains

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2--4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20--35\% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at this https URL.

67. 【2602.20685】RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation

链接：https://arxiv.org/abs/2602.20685

作者：Yichen Xie,Chensheng Peng,Mazen Abdelfattah,Yihan Hu,Jiezhi Yang,Eric Higgins,Ryan Brigden,Masayoshi Tomizuka,Wei Zhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：physically plausible behavior, foundation models aim, World foundation models, plausible behavior, multiview world model

备注： Accepted by CVPR 2026; Project website: [this http URL](http://yichen928.github.io/raynova)

点击查看摘要

Abstract:World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at this https URL.

68. 【2602.20673】GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio

链接：https://arxiv.org/abs/2602.20673

作者：Hao Zhang,Lue Fan,Qitai Wang,Wenbo Li,Zehuan Wu,Lewei Lu,Zhaoxiang Zhang,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving systems, high-fidelity driving simulator, autonomous driving, driving systems, training and evaluating

备注：

点击查看摘要

Abstract:A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.

69. 【2602.20672】BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

链接：https://arxiv.org/abs/2602.20672

作者：Eliran Kachlon,Alexander Visheratin,Nimrod Sarid,Tal Hacham,Eyal Gutflaish,Saar Huberman,Hezi Zisman,David Ruppin,Ron Mokady

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：approaches leveraging long, recent approaches leveraging, support fine-grained generation, realism and controllability, leveraging long

备注：

点击查看摘要

Abstract:Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.

70. 【2602.20666】BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity

链接：https://arxiv.org/abs/2602.20666

作者：Juil Koo,Wei-Tung Lin,Chanho Park,Chanhyeok Park,Minhyuk Sung

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：model, bounding boxes, generative model, abstract ideas, Human creativity

备注： Project page: [this https URL](https://boxsplitgen.github.io)

点击查看摘要

Abstract:Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation -- particularly for detailing abstractions from coarse to fine -- have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technical components of our framework are two generative models: the box-splitting generative model and the box-to-shape generative model. The first model, named BoxSplitGen, generates a collection of 3D part bounding boxes with varying granularity by iteratively splitting coarse bounding boxes. It utilizes part bounding boxes created through agglomerative merging and learns the reverse of the merging process -- the splitting sequences. The model consists of two main components: the first learns the categorical distribution of the box to be split, and the second learns the distribution of the two new boxes, given the set of boxes and the indication of which box to split. The second model, the box-to-shape generative model, is trained by leveraging the 3D shape priors learned by an existing 3D diffusion model while adapting the model to incorporate bounding box conditioning. In our experiments, we demonstrate that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model. Also, we show that our box-to-shape model, based on a state-of-the-art 3D diffusion model, provides superior results compared to a previous model.

71. 【2602.20664】AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?

链接：https://arxiv.org/abs/2602.20664

作者：Hailong Yan,Shice Liu,Tao Wang,Xiangtao Zhang,Yijie Zhong,Jinwei Chen,Le Zhang,Bo Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Custom Storyboard Generation, Custom Storyboard, Storyboard Generation, multi-character consistent storytelling, aims to produce

备注： Tech Report

点击查看摘要

Abstract:Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to "copy-paste" pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for assessing stylized, non-realistic animation. To address these, we propose AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework for CSG. Inspired by Disney's "Combination of Straight Ahead and Pose to Pose" workflow, AnimeAgent leverages I2V's implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement. We also collect a human-annotated CSG benchmark with ground-truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.

72. 【2602.20658】Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video

链接：https://arxiv.org/abs/2602.20658

作者：Mohammad Sadra Rajabi,Aanuoluwapo Ojelade,Sunwook Kim,Maury A. Nussbaum

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：ergonomic risk assessment, work-related musculoskeletal disorders, quantifying physical exposure, informing ergonomic interventions, NIOSH Lifting Equation

备注：

点击查看摘要

Abstract:Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.

73. 【2602.20653】SD4R: Sparse-to-Dense Learning for 3D Object Detection with 4D Radar

链接：https://arxiv.org/abs/2602.20653

作者：Xiaokai Bai,Jiahao Cheng,Songkai Wang,Yixuan Luo,Lianqing Zheng,Xiaohan Zhang,Si-Yuan Cao,Hui-Liang Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：radar measurements offer, radar point clouds, point clouds, measurements offer, offer an affordable

备注： 7 pages, 5 figures, 4 tables

点击查看摘要

Abstract:4D radar measurements offer an affordable and weather-robust solution for 3D perception. However, the inherent sparsity and noise of radar point clouds present significant challenges for accurate 3D object detection, underscoring the need for effective and robust point clouds densification. Despite recent progress, existing densification methods often fail to address the extreme sparsity of 4D radar point clouds and exhibit limited robustness when processing scenes with a small number of points. In this paper, we propose SD4R, a novel framework that transforms sparse radar point clouds into dense representations. SD4R begins by utilizing a foreground point generator (FPG) to mitigate noise propagation and produce densified point clouds. Subsequently, a logit-query encoder (LQE) enhances conventional pillarization, resulting in robust feature representations. Through these innovations, our SD4R demonstrates strong capability in both noise reduction and foreground point densification. Extensive experiments conducted on the publicly available View-of-Delft dataset demonstrate that SD4R achieves state-of-the-art performance. Source code is available at this https URL.

74. 【2602.20650】Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

链接：https://arxiv.org/abs/2602.20650

作者：Chenyue Yu,Lingao Xiao,Jinhong Deng,Ivor W. Tsang,Yang He

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：demands pose challenges, Large-scale image datasets, Large-scale image, high storage demands, storage demands pose

备注： Accepted by ICLR 2026

点击查看摘要

Abstract:Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{this https URL}{this https URL}.

75. 【2602.20636】SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

链接：https://arxiv.org/abs/2602.20636

作者：Rulin Zhou,Guankun Wang,An Wang,Yujie Ma,Lixin Ouyang,Bolin Cui,Junyan Li,Chaowei Zhu,Mingyang Li,Ming Chen,Xiaopin Zhong,Peng Lu,Jiankun Wang,Xianming Liu,Hongliang Ren

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：minimally invasive surgery, efficient minimally invasive, direct object-centric assumptions, Accurate and stable, conflate visual attention

备注：

点击查看摘要

Abstract:Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.

76. 【2602.20632】Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection

链接：https://arxiv.org/abs/2602.20632

作者：Xiaokai Bai,Lianqing Zheng,Si-Yuan Cao,Xiaohan Zhang,Zhe Wu,Beinan Yu,Fang Wang,Jie Bai,Hui-Liang Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：promising sensing modality, autonomous driving due, robustness and affordability, promising sensing, sensing modality

备注： 14 pages, 10 figures, 13 tables

点击查看摘要

Abstract:4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at this http URL.

77. 【2602.20630】From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

链接：https://arxiv.org/abs/2602.20630

作者：Yepeng Liu,Hao Li,Liwen Yang,Fangzhen Li,Xudi Ge,Yuliang Gu,kuang Gao,Bing Wang,Guang Chen,Hangjun Ye,Yongchao Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision systems, component of modern, Keypoint-based matching, fundamental component, Keypoint-based

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.

78. 【2602.20627】Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

链接：https://arxiv.org/abs/2602.20627

作者：Zhaonian Kuang,Rui Ding,Meng Yang,Xinhu Zheng,Gang Hua

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：deep learning based, strong human bias, high-performance deep learning, http URL, complicated visual variation

备注： IJCV

点击查看摘要

Abstract:Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera this http URL, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D objects are always captured in particular scenes with fixed camera poses, and hence lacks necessary diversity. Such tight entanglement induces the challenging issues of insufficient utilization and overfitting to uniform training data. To mitigate this, we propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data. We first fully decompose training images into textured 3D object point models and background scenes in an efficient computation and storage manner. We then continuously recompose new training images in each epoch by inserting the 3D objects into the freespace of the background scenes, and rendering them with perturbed camera poses from textured 3D point representation. In this way, the refreshed training data in all epochs can cover the full spectrum of independent object, scene, and camera pose combinations. This scheme can serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings. In the sparsely-supervised setting, objects closest to the ego-camera for all instances are sparsely annotated. We then can flexibly increase the annotated objects to control annotation cost. For validation, our method is widely applied to five representative M3OD models and evaluated on both the KITTI and the more complicated Waymo datasets.

79. 【2602.20618】RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

链接：https://arxiv.org/abs/2602.20618

作者：Haonan An,Xiaohui Ye,Guang Hua,Yihang Tao,Hangcheng Cao,Xiangyu Yu,Yuguang Fang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：posing unprecedented challenges, severely undermining visual, undermining visual integrity, facilitated sophisticated face, severely undermining

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.

80. 【2602.20616】Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model

链接：https://arxiv.org/abs/2602.20616

作者：Xueqiang Lv,Shizhou Zhang,Yinghui Xing,Di Xu,Peng Wang,Yanning Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Open-world object detection, requires incrementally detecting, reliably identifying unknown, Open-world object, requires incrementally

备注：

点击查看摘要

Abstract:Open-world object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known-unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly "knowing the unknown". To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known-unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.

81. 【2602.20608】VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

链接：https://arxiv.org/abs/2602.20608

作者：Aihua Mao,Kaihang Huang,Yong-Jin Liu,Chee Seng Chan,Ying He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：embodied visual reasoning, aims to identify, capability essential, essential to embodied, visual reasoning

备注：

点击查看摘要

Abstract:3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

82. 【2602.20597】Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

链接：https://arxiv.org/abs/2602.20597

作者：Yuejiao Su,Yi Wang,Lei Yao,Yawen Cui,Lap-Pui Chau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：next-generation embodied agents, developing next-generation embodied, egocentric human-environment interactions, embodied agents, fine-grained understanding

备注：

点击查看摘要

Abstract:A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to "interaction illusion", producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at this https URL.

83. 【2602.20584】Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change

链接：https://arxiv.org/abs/2602.20584

作者：Beverley Gorry,Tobias Fischer,Michael Milford,Alejandro Fontan

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：repeated site visits, site visits separated, environmental monitoring requires, Long-term environmental monitoring, reconstruct and align

备注：

点击查看摘要

Abstract:Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.

84. 【2602.20583】PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

链接：https://arxiv.org/abs/2602.20583

作者：Wonyong Seo,Jaeho Moon,Jaehyup Lee,Soo Ye Kim,Munchurl Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precise user control, Propagation-based video editing, single edited frame, enables precise user, Propagation-based video

备注： The first two authors contributed equally to this work (equal contribution)

点击查看摘要

Abstract:Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

85. 【2602.20577】Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

链接：https://arxiv.org/abs/2602.20577

作者：Jiaru Zhang,Manav Gagvani,Can Cui,Juntong Peng,Ruqi Zhang,Ziran Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Large Language, emerged as promising, promising candidates, Vision-Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

86. 【2602.20575】An interactive enhanced driving dataset for autonomous driving

链接：https://arxiv.org/abs/2602.20575

作者：Haojie Feng,Peizhi Zhang,Mengjie Tian,Xinrui Zhang,Zhuoren Li,Junpeng Huang,Xiurong Wang,Junfan Zhu,Jianzhou Wang,Dongxiao Yin,Lu Xiong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：full automation demands, automation demands robust, inadequate multimodal alignment, demands robust interactive, Interactive Enhanced Driving

备注：

点击查看摘要

Abstract:The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird's Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset's reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.

87. 【2602.20569】AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

链接：https://arxiv.org/abs/2602.20569

作者：Jiaqi Wu,Yuchen Zhou,Muduo Xu,Zisheng Liang,Simiao Ren,Jiayu Xue,Meige Yang,Siying Chen,Jingheng Huan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：benchmark targeting exclusively, dedicated benchmark targeting, targeting exclusively, Adobe Photoshop, AUC

备注： 17 pages, 10 figures

点击查看摘要

Abstract:We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.

88. 【2602.20566】BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

链接：https://arxiv.org/abs/2602.20566

作者：Haosheng Li,Weixin Mao,Zihan Lan,Hongwei Xiong,Hongan Wang,Chenyang Si,Ziwei Liu,Xiaoming Deng,Hua Chen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision Language, leveraging Large Vision, Vision Language Models, Large Vision, Vision Language

备注： 9 pages, 10 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the {\pi}0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

89. 【2602.20556】WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos

链接：https://arxiv.org/abs/2602.20556

作者：Hanhui Li,Xuan Huang,Wanquan Liu,Yuhao Cheng,Long Chen,Yiqiang Yan,Xiaodan Liang,Chenqiang Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：existing methods rely, extreme poses, hand-object interactions, motion blur, recent progress

备注：

点击查看摘要

Abstract:Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a $15.8\%$ relative gain in PSNR and a $23.1\%$ relative reduction in LPIPS). Our implementation and dataset are available at this https URL.

90. 【2602.20551】CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

链接：https://arxiv.org/abs/2602.20551

作者：Zhenran Tang,Rohan Nagabhirava,Changliu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scenarios frequently encountered, printing environments, struggles with uncommon, scenarios frequently, Verbal-prompted segmentation

备注：

点击查看摘要

Abstract:Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

91. 【2602.20550】he Finite Primitive Basis Theorem for Computational Imaging: Formal Foundations of the OperatorGraph Representation

链接：https://arxiv.org/abs/2602.20550

作者：Chengshuai Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Computational imaging forward, coded aperture spectral, aperture spectral cameras, MRI scanners, cameras to MRI

备注：

点击查看摘要

Abstract:Computational imaging forward models, from coded aperture spectral cameras to MRI scanners, are traditionally implemented as monolithic, modality-specific codes. We prove that every forward model in a broad, precisely defined operator class Cimg (encompassing clinical, scientific, and industrial imaging modalities, both linear and nonlinear) admits an epsilon-approximate representation as a typed directed acyclic graph (DAG) whose nodes are drawn from a library of exactly 11 canonical primitives: Propagate, Modulate, Project, Encode, Convolve, Accumulate, Detect, Sample, Disperse, Scatter, and Transform. We call this the Finite Primitive Basis Theorem. The proof is constructive: we provide an algorithm that, given any H in Cimg, produces a DAG G with relative operator error at most epsilon and graph complexity within prescribed bounds. We further prove that the library is minimal: removing any single primitive causes at least one modality to lose its epsilon-approximate representation. A systematic analysis of nonlinearities in imaging physics shows they fall into two structural categories: pointwise scalar functions (handled by Transform) and self-consistent iterations (unrolled into existing linear primitives). Empirical validation on 31 linear modalities confirms eimg below 0.01 with at most 5 nodes and depth 5, and we provide constructive DAG decompositions for 9 additional nonlinear modalities. These results establish mathematical foundations for the Physics World Model (PWM) framework.

92. 【2602.20549】Sample-efficient evidence estimation of score based priors for model selection

链接：https://arxiv.org/abs/2602.20549

作者：Frederic Wang,Katherine L. Bouman

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)

关键词：avoid severe bias, model evidence, Bayesian inverse problems, inverse problems, solving inverse problems

备注： ICLR 2026

点击查看摘要

Abstract:The choice of prior is central to solving ill-posed imaging inverse problems, making it essential to select one consistent with the measurements $y$ to avoid severe bias. In Bayesian inverse problems, this could be achieved by evaluating the model evidence $p(y \mid M)$ under different models $M$ that specify the prior and then selecting the one with the highest value. Diffusion models are the state-of-the-art approach to solving inverse problems with a data-driven prior; however, directly computing the model evidence with respect to a diffusion prior is intractable. Furthermore, most existing model evidence estimators require either many pointwise evaluations of the unnormalized prior density or an accurate clean prior score. We propose \method, an estimator of the model evidence of a diffusion prior by integrating over the time-marginals of posterior sampling methods. Our method leverages the large amount of intermediate samples naturally obtained during the reverse diffusion sampling process to obtain an accurate estimation of the model evidence using only a handful of posterior samples (e.g., 20). We also demonstrate how to implement our estimator in tandem with recent diffusion posterior sampling methods. Empirically, our estimator matches the model evidence when it can be computed analytically, and it is able to both select the correct diffusion model prior and diagnose prior misfit under different highly ill-conditioned, non-linear inverse problems, including a real-world black hole imaging problem.

93. 【2602.20548】Robust Spiking Neural Networks Against Adversarial Attacks

链接：https://arxiv.org/abs/2602.20548

作者：Shuai Wang,Malu Zhang,Yulin Jiang,Dehao Zhang,Ammar Belatreche,Yu Liang,Yimeng Shan,Zijian Zhou,Yang Yang,Haizhou Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Spiking Neural Networks, Neural Networks, Spiking Neural, represent a promising, spike-driven characteristics

备注： Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing due to their bio-plausible and spike-driven characteristics. However, the robustness of SNNs in complex adversarial environments remains significantly constrained. In this study, we theoretically demonstrate that those threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs. We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances. To address this challenge, we propose a Threshold Guarding Optimization (TGO) method, which comprises two key aspects. First, we incorporate additional constraints into the loss function to move neurons' membrane potentials away from their thresholds. It increases SNNs' gradient sparsity, thereby reducing the theoretical upper bound of adversarial attacks. Second, we introduce noisy spiking neurons to transition the neuronal firing mechanism from deterministic to probabilistic, decreasing their state-flipping probability due to minor disturbances. Extensive experiments conducted in standard adversarial scenarios prove that our method significantly enhances the robustness of directly trained SNNs. These findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications.

94. 【2602.20543】Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing

链接：https://arxiv.org/abs/2602.20543

作者：Subhra Jyoti Mandal,Lara Rachidi,Puneet Jain,Matthieu Duvinage,Sander W. Timmer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Environmental Monitoring programs, Environmental Monitoring, Colony-forming unit, stringent quality standards, component of Environmental

备注：

点击查看摘要

Abstract:Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical-grade requirements. A custom Detectron2 model trained on GSK's dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi-agent framework combining DL with vision-language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self-improvement. Initial DL-based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.

95. 【2602.20537】PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning

链接：https://arxiv.org/abs/2602.20537

作者：Xinyong Cai,Changbin Sun,Yong Wang,Hongyu Yang,Yuankai Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：forecast future frames, Spatiotemporal predictive learning, predictive learning, aims to forecast, range of applications

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions ($1 \times k$ followed by $k \times 1$), reducing per-channel computational cost from $O(k^2)$ to $O(2k)$. PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs. Our code is available at this https URL.

96. 【2602.20531】A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

链接：https://arxiv.org/abs/2602.20531

作者：Azrin Sultana,Firoz Ahmed

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：app rating prediction, existing app rating, app rating, significant indicators, rating prediction

备注： 24 pages, 10 figures

点击查看摘要

Abstract:App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.

97. 【2602.20520】How Do Inpainting Artifacts Propagate to Language?

链接：https://arxiv.org/abs/2602.20520

作者：Pratham Yashwante,Davit Abrahamyan,Shresth Grover,Sukruth Rao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：diffusion-based inpainting affect, introduced by diffusion-based, affect language generation, inpainting affect language, visual artifacts introduced

备注：

点击查看摘要

Abstract:We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.

98. 【2602.20511】Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models

链接：https://arxiv.org/abs/2602.20511

作者：Limai Jiang,Ruitao Xie,Bokai Yang,Huazhen Huang,Juan He,Yufu Huo,Zikai Wang,Yang Wei,Yunpeng Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling precise localization, image segmentation plays, clinical decision-making, enabling precise, guiding interventions

备注： Preprint

点击查看摘要

Abstract:Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at this https URL.

99. 【2602.20501】Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

链接：https://arxiv.org/abs/2602.20501

作者：Qing Zhang,Xuesong Li,Jing Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Foundation Models, visual system, Visual Foundation, models, interaction

备注： 11 pages, 12 figures, Accepted to CVPR 2026

点击查看摘要

Abstract:What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

100. 【2602.20500】Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining

链接：https://arxiv.org/abs/2602.20500

作者：Keyu Zhou,Peisen Xu,Yahao Wu,Jiming Chen,Gaofeng Li,Shunlei Li

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous laparoscopic camera, rapid tool-tissue interactions, laparoscopic camera control, Autonomous laparoscopic, safe surgical view

备注： Submitted to IEEE Transactions on Robotics (T-RO). 19 pages, 9 figures

点击查看摘要

Abstract:Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.

101. 【2602.20497】LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

链接：https://arxiv.org/abs/2602.20497

作者：Peiliang Cai,Jiacheng Liu,Haowen Xu,Xinyu Wang,Chang Zou,Linfeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved remarkable success, video generation tasks, achieved remarkable, remarkable success, success in image

备注：

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

102. 【2602.20496】Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching

链接：https://arxiv.org/abs/2602.20496

作者：Jintu Zheng,Qizhe Liu,HuangXin Xu,Zhuojie Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recurrent Neural Networks, Neural Networks, Recurrent Neural, hinders edge deployment, dependence on Recurrent

备注： Accepted to CVPR 2026 (3D vision track)

点击查看摘要

Abstract:While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\times$ speedup, 76.6\% memory peak reduction and 80.9\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: this https URL.

103. 【2602.20479】Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation

链接：https://arxiv.org/abs/2602.20479

作者：Lin Li,Ziqi Jiang,Gefan Ye,Zhenqi He,Jiahui Li,Jun Xiao,Kwang-Ting Cheng,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cross-modal few-shot adaptation, few-shot adaptation treat, adaptation treat visual-semantic, Recent advances, Hyperbolic Flow Matching

备注：

点击查看摘要

Abstract:Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold's exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail'' rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.

104. 【2602.20476】SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

链接：https://arxiv.org/abs/2602.20476

作者：Anindita Ghosh,Vladislav Golyanik,Taku Komura,Philipp Slusallek,Christian Theobalt,Rishabh Dabral

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthesizing text-driven, realistic scenes requires, scenes requires learning, avoiding collisions, requires learning

备注： 13 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

105. 【2602.20423】MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

链接：https://arxiv.org/abs/2602.20423

作者：Taha Koleilat,Hojat Asgariandehkordi,Omid Nejati Manzari,Berardino Barile,Yiming Xiao,Hassan Rivaz

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：ambiguous anatomical features, Medical image segmentation, remains challenging due, image segmentation remains, Medical image

备注： CVPR 2026; Project Page: [this https URL](https://tahakoleilat.github.io/MedCLIPSeg)

点击查看摘要

106. 【2602.20417】gQIR: Generative Quanta Image Reconstruction

链接：https://arxiv.org/abs/2602.20417

作者：Aryan Garg,Sizhuo Ma,Mohit Gupta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Capturing high-quality images, Capturing high-quality, fundamental challenge, challenge in computational, Capturing

备注： CVPR 2026

点击查看摘要

Abstract:Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{this https URL}{this https URL}.

107. 【2602.20412】SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

链接：https://arxiv.org/abs/2602.20412

作者：Aayush Dhakal,Subash Khanal,Srikumar Sastry,Jacob Arndt,Philipe Ambrozio Dias,Dalton Lunga,Nathan Jacobs

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fake image detection, research and society, rapid advancement, advancement of generative, critical challenge

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85\% accuracy and +69.62\% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.

108. 【2602.20409】CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

链接：https://arxiv.org/abs/2602.20409

作者：Mainak Singha,Sarthak Mehrotra,Paolo Casari,Subhasis Chaudhuri,Elisa Ricci,Biplab Banerjee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent vision-language models, impressive cross-modal reasoning, demonstrate impressive cross-modal, Recent vision-language, CLIP demonstrate impressive

备注： Accepted in CVPR 2026

点击查看摘要

Abstract:Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at this https URL.

109. 【2602.20363】Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field

链接：https://arxiv.org/abs/2602.20363

作者：Sheyang Tang,Armin Shafiee Sarvestani,Jialu Xu,Xiaoyu Xu,Zhou Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scene depends strongly, depends strongly, aesthetic, scene depends, limited camera adjustments

备注： 14 pages, 10 figures

点击查看摘要

Abstract:The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.

110. 【2602.20360】Momentum Guidance: Plug-and-Play Guidance for Flow Models

链接：https://arxiv.org/abs/2602.20360

作者：Runlong Liao,Jian Yu,Baiyu Su,Chi Zhang,Lizhang Chen,Qiang Liu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：high-quality generative modeling, vanilla conditional form, lack fine-grained detail, fine-grained detail due, generative modeling

备注：

点击查看摘要

Abstract:Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG's effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.

111. 【2602.20354】3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

链接：https://arxiv.org/abs/2602.20354

作者：Bhavik Chandna,Kelsey R. Allen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：evolving rapidly, generation is evolving, video, video generation, realism

备注：

点击查看摘要

Abstract:AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at this https URL.

112. 【2602.20351】BiRQA: Bidirectional Robust Quality Assessment for Images

链接：https://arxiv.org/abs/2602.20351

作者：Aleksandr Gushchin,Dmitriy S. Vatolin,Anastasia Antsiferova

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Full-Reference image quality, image quality assessment, current neural metrics, neural metrics remain, metrics remain slow

备注：

点击查看摘要

Abstract:Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean "anchor" samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ~3x faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.

113. 【2602.20342】Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques

链接：https://arxiv.org/abs/2602.20342

作者：Christos Maikos,Georgios Angelidis,Georgios Th. Papadopoulos

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：converting drone-captured video, drone-captured video streams, pipeline capable, capable of converting, converting drone-captured

备注： 7 pages, 2 figures

点击查看摘要

Abstract:In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7\% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.

114. 【2602.20330】Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

链接：https://arxiv.org/abs/2602.20330

作者：Jingcheng Yang,Tianhu Xiong,Shengyi Qian,Klara Nahrstedt,Mingyuan Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：opaque black boxes, remain opaque black, Vision-language models, black boxes, powerful but remain

备注： To appear in the Findings of CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.

115. 【2602.20328】GSNR: Graph Smooth Null-Space Representation for Inverse Problems

链接：https://arxiv.org/abs/2602.20328

作者：Romario Gualdrón-Hurtado,Roman Jacome,Rafael S. Suarez,Henry Arguello

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optimization and Control (math.OC)

关键词：imaging are ill-posed, leading to infinitely, priors promote solutions, null-space, non-trivial null-space

备注： 23 pages, 24 figures, Accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the $p$-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by $p$ modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.

116. 【2602.20312】N4MC: Neural 4D Mesh Compression

链接：https://arxiv.org/abs/2602.20312

作者：Guodong Chen,Huanshuo Dong,Mallesham Dasari

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：efficiently compress time-varying, compress time-varying mesh, neural compression framework, time-varying mesh sequences, framework to efficiently

备注：

点击查看摘要

Abstract:We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: this https URL.

117. 【2602.20291】De-rendering, Reasoning, and Repairing Charts with Vision-Language Models

链接：https://arxiv.org/abs/2602.20291

作者：Valentin Bonas,Martin Sinnona,Viviana Siless,Emmanuel Iarussi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Data visualizations, scientific communication, everyday decision-making, mislead audiences, central to scientific

备注：

点击查看摘要

Abstract:Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.

118. 【2602.20231】UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

链接：https://arxiv.org/abs/2602.20231

作者：Manish Kumar Govind,Dominick Reilly,Pu Wang,Srijan Das

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：robot action supervision, Latent action, Latent action representations, unified latent action, explicit robot action

备注： [this https URL](https://manishgovind.github.io/unilact-vla/)

点击查看摘要

Abstract:Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.

119. 【2602.20205】OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

链接：https://arxiv.org/abs/2602.20205

作者：Xiwen Chen,Wenhui Zhu,Gen Li,Xuanzhao Dong,Yujian Xiong,Hao Wang,Peijie Qiu,Qingquan Song,Zhipeng Wang,Shao Tang,Yalin Wang,Abolfazl Razi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-modal large language, large language models, strong visual-language reasoning, Multi-modal large, redundant visual tokens

备注： Accepted by CVPR2026 (Findings). arXiv admin note: text overlap with [arXiv:2503.02175](https://arxiv.org/abs/2503.02175) by other authors

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at this https URL.

120. 【2602.20200】Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

链接：https://arxiv.org/abs/2602.20200

作者：Zaijing Li,Bing Hu,Rui Shao,Gongwei Chen,Dongmei Jiang,Pengwei Xie,Jianye Hao,Liqiang Nie

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：robotic manipulation, dominant paradigm, paradigm for robotic, Hierarchical, action generation

备注： 17 pages, 8 figures

点击查看摘要

Abstract:Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

121. 【2602.20165】VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography

链接：https://arxiv.org/abs/2602.20165

作者：Dorsa EPMoghaddam,Feng Gao,Drew Bernard,Kavya Sinha,Mehdi Razavi,Behnaam Aazhang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Contemporary high-density mapping, MRI remain time, high-density mapping techniques, Contemporary high-density, MRI remain

备注： 8 pages, 3 figures, 3 tabels

点击查看摘要

Abstract:Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.

122. 【2310.15741】Interpretable Medical Image Classification using Prototype Learning and Privileged Information

链接：https://arxiv.org/abs/2310.15741

作者：Luisa Gallee,Meinrad Beer,Michael Goetz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：medical imaging, essential requirement, requirement in medical, Advanced deep learning, Abstract

备注： MICCAI 2023 Medical Image Computing and Computer Assisted Intervention

点击查看摘要

Abstract:Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the training process can be used to create an understandable and powerful model. We propose an innovative solution called Proto-Caps that leverages the benefits of capsule networks, prototype learning and the use of privileged information. Evaluating the proposed solution on the LIDC-IDRI dataset shows that it combines increased interpretability with above state-of-the-art prediction performance. Compared to the explainable baseline model, our method achieves more than 6 % higher accuracy in predicting both malignancy (93.0 %) and mean characteristic features of lung nodules. Simultaneously, the model provides case-based reasoning with prototype representations that allow visual validation of radiologist-defined attributes.

123. 【2602.20994】Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures

链接：https://arxiv.org/abs/2602.20994

作者：Yubin Ge,Yongsong Huang,Xiaofeng Liu

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：dense tumor voxel, tumor voxel labels, learning seeks, seeks to alleviate, voxel labels

备注： IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

124. 【2602.20539】Progressive Per-Branch Depth Optimization for DEFOM-Stereo and SAM3 Joint Analysis in UAV Forestry Applications

链接：https://arxiv.org/abs/2602.20539

作者：Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：complex forest canopies, individual branch analysis, dense disparity maps, autonomous UAV-based tree, modern stereo matchers

备注：

点击查看摘要

Abstract:Accurate per-branch 3D reconstruction is a prerequisite for autonomous UAV-based tree pruning; however, dense disparity maps from modern stereo matchers often remain too noisy for individual branch analysis in complex forest canopies. This paper introduces a progressive pipeline integrating DEFOM-Stereo foundation-model disparity estimation, SAM3 instance segmentation, and multi-stage depth optimization to deliver robust per-branch point clouds. Starting from a naive baseline, we systematically identify and resolve three error families through successive refinements. Mask boundary contamination is first addressed through morphological erosion and subsequently refined via a skeleton-preserving variant to safeguard thin-branch topology. Segmentation inaccuracy is then mitigated using LAB-space Mahalanobis color validation coupled with cross-branch overlap arbitration. Finally, depth noise - the most persistent error source - is initially reduced by outlier removal and median filtering, before being superseded by a robust five-stage scheme comprising MAD global detection, spatial density consensus, local MAD filtering, RGB-guided filtering, and adaptive bilateral filtering. Evaluated on 1920x1080 stereo imagery of Radiata pine (Pinus radiata) acquired with a ZED Mini camera (63 mm baseline) from a UAV in Canterbury, New Zealand, the proposed pipeline reduces the average per-branch depth standard deviation by 82% while retaining edge fidelity. The result is geometrically coherent 3D point clouds suitable for autonomous pruning tool positioning. All code and processed data are publicly released to facilitate further UAV forestry research.

125. 【2602.20316】Inspectorch: Efficient rare event exploration in solar observations

链接：https://arxiv.org/abs/2602.20316

作者：C. J. Díaz Baso,I. J. Soler Poquet,C. Kuckein,M. van Noort,N. Poirier

类目：olar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV)

关键词：small spatiotemporal scales, Sun is observed, unprecedented detail, enabling studies, spatiotemporal scales

备注： Comments: 12+1 pages, 11+2 figures, submitted to AA

点击查看摘要

Abstract:The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensional solar observations and optimize our computational resources to the study of these extreme phenomena. We introduce Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations. Once optimized, it assigns a probability to each sample, allowing us to identify unusual events. We apply this approach by applying it to observations from the Hinode Spectro-Polarimeter, the Interface Region Imaging Spectrograph, the Microlensed Hyperspectral Imager at Swedish 1-m Solar Telescope, the Atmospheric Imaging Assembly on board the Solar Dynamics Observatory and the Extreme Ultraviolet Imager on board Solar Orbiter. We find that the algorithm assigns consistently lower probabilities to spectra that exhibit unusual features. For example, it identifies profiles with very strong Doppler shifts, uncommon broadening, and temporal dynamics associated with small-scale reconnection events, among others. As a result, Inspectorch demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets. The resulting probabilistic anomaly scores allow computational resources to be focused on the most informative and physically relevant events. We make our Python package publicly available at this https URL.