本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新995篇论文,其中:

  • 自然语言处理101
  • 信息检索27
  • 计算机视觉251

自然语言处理

1. 【2602.20135】KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

链接https://arxiv.org/abs/2602.20135

作者:Mohammad Amanlou,Erfan Shafiee Moghaddam,Yasaman Amou Jafari,Mahdi Noori,Farhan Farsi,Behnam Bahrak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, rise of large, large language, instrumental in applications, RAG

备注: Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

点击查看摘要

Abstract:With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

2. 【2602.20133】AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization

链接https://arxiv.org/abs/2602.20133

作者:Mert Cemri,Shubham Agrawal,Akshat Gupta,Shu Liu,Audrey Cheng,Qiuyang Mang,Ashwin Naren,Lutfi Eren Erdogan,Koushik Sen,Matei Zaharia,Alex Dimakis,Ion Stoica

类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, automated program generation, Large Language, semantic mutation operators

备注

点击查看摘要

Abstract:The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promising frontiers remain under-exploited. We introduce AdaEvolve, a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem. AdaEvolve uses an "accumulated improvement signal" to unify decisions across three levels: Local Adaptation, which dynamically modulates the exploration intensity within a population of solution candidates; Global Adaptation, which routes the global resource budget via bandit-based scheduling across different solution candidate populations; and Meta-Guidance which generates novel solution tactics based on the previously generated solutions and their corresponding improvements when the progress stalls. We demonstrate that AdaEvolve consistently outperforms the open-sourced baselines across 185 different open-ended optimization problems including combinatorial, systems optimization and algorithm design problems.

3. 【2602.20130】o Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

链接https://arxiv.org/abs/2602.20130

作者:Zaifu Zhan,Min Zeng,Shuang Zhou,Yiran Song,Xiaoyi Chen,Yu Hou,Yifan Wu,Yang Ruan,Rui Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Toggle, Selective CoT, large language models, Selective, avoiding unnecessary reasoning

备注

点击查看摘要

Abstract:Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.20130 [cs.CL]

(or
arXiv:2602.20130v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.20130

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Zaifu Zhan [view email] [v1]
Mon, 23 Feb 2026 18:42:50 UTC (1,233 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering, by Zaifu Zhan and 8 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CL

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

4. 【2602.20122】NanoKnow: How to Know What Your Language Model Knows

链接https://arxiv.org/abs/2602.20122

作者:Lingwei Gu,Nour Jedidi,Jimmy Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large language models, large language, pre-training data, pre-training, language models

备注

点击查看摘要

Abstract:How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at this https URL.

5. 【2602.20092】BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop

链接https://arxiv.org/abs/2602.20092

作者:Leshem Choshen,Ryan Cotterell,Mustafa Omer Gul,Jaap Jumelet,Tal Linzen,Aaron Mueller,Suchir Salhan,Raj Sanjay Shah,Alex Warstadt,Ethan Gotlieb Wilcox

类目:Computation and Language (cs.CL)

关键词:cognitive modeling, aims to dissolve, dissolve the boundaries, boundaries between cognitive, BabyLM aims

备注: 8 pages, 1 table. arXiv admin note: substantial text overlap with [arXiv:2502.10645](https://arxiv.org/abs/2502.10645)

点击查看摘要

Abstract:BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 4th BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: Multilingual. We also call for papers outside the competition in any relevant areas. These include training efficiency, cognitively plausible research, weak model evaluation, and more.

Comments:
8 pages, 1 table. arXiv admin note: substantial text overlap with arXiv:2502.10645

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.20092 [cs.CL]

(or
arXiv:2602.20092v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.20092

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2602.20091】How Retrieved Context Shapes Internal Representations in RAG

链接https://arxiv.org/abs/2602.20091

作者:Samuel Yeh,Sharon Li

类目:Computation and Language (cs.CL)

关键词:large language models, enhances large language, Retrieval-augmented generation, enhances large, language models

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work has largely examined these phenomena through output behavior, little is known about how retrieved context shapes the internal representations that mediate information integration in RAG. In this work, we study RAG through the lens of latent representations. We systematically analyze how different types of retrieved documents affect the hidden states of LLMs, and how these internal representation shifts relate to downstream generation behavior. Across four question-answering datasets and three LLMs, we analyze internal representations under controlled single- and multi-document settings. Our results reveal how context relevancy and layer-wise processing influence internal representations, providing explanations on LLMs output behaviors and insights for RAG system design.

7. 【2602.20065】Multilingual Large Language Models do not comprehend all natural languages to equal degrees

链接https://arxiv.org/abs/2602.20065

作者:Natalia Moskvina,Raquel Montero,Masaya Yoshida,Ferdy Hubers,Paolo Morosi,Walid Irhaymi,Jin Yan,Tamara Serrano,Elena Pagliarini,Fritz Günther,Evelina Leivada

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, humans access information, Large Language, play a critical, access information

备注: 36 pages, 3 figures, 2 tables, 4 supplementary tables

点击查看摘要

Abstract:Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind human baselines in all of them, albeit to different degrees. Contrary to what was expected, English is not the best-performing language, as it was systematically outperformed by several Romance languages, even lower-resource ones. We frame the results by discussing the role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities.

8. 【2602.20052】Entropy in Large Language Models

链接https://arxiv.org/abs/2602.20052

作者:Marco Scharringhausen

类目:Computation and Language (cs.CL)

关键词:American National Corpus, Open American National, finite alphabet, information source generating, generating an unlimited

备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic model for these LLMs, following a constant random distribution and the source itself thus being stationary. We compare this source entropy (per word) to that of natural language (written or spoken) as represented by the Open American National Corpus (OANC). Our results indicate that the word entropy of such LLMs is lower than the word entropy of natural speech both in written or spoken form. The long-term goal of such studies is to formalize the intuitions of information and uncertainty in large language training to assess the impact of training an LLM from LLM generated training data. This refers to texts from the world wide web in particular.

9. 【2602.20042】Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously

链接https://arxiv.org/abs/2602.20042

作者:Han Bao,Yue Huang,Xiaoda Wang,Zheyuan Zhang,Yujun Zhou,Carl Yang,Xiangliang Zhang,Yanfang Ye

类目:Computation and Language (cs.CL)

关键词:Large language models, current alignment practice, Large language, complex socio-technical systems, language models

备注: 26 pages, 5 figures

点击查看摘要

Abstract:Large language models are being deployed in complex socio-technical systems, which exposes limits in current alignment practice. We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures follow from the mathematics and incentives of scalarization and lead to \textbf{structural} value flattening, \textbf{normative} representation loss, and \textbf{cognitive} uncertainty blindness. We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure, support plural and democratic representation, and incorporate epistemic mechanisms for interaction and clarification. To make this approach practical, we propose seven interdependent pillars organized into three phases. We identify key challenges in data collection, training objectives, and evaluation, outlining complementary technical and governance directions. Taken together, these measures reframe alignment as a lifecycle problem of dynamic normative governance rather than as a single instance optimization task.

10. 【2602.20040】AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization

链接https://arxiv.org/abs/2602.20040

作者:Fahmida Liza Piya,Rahmatollah Beheshti

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, maintaining factual consistency, factual consistency remains, consistency remains challenging, remains challenging due

备注

点击查看摘要

Abstract:Large language models (LLMs) offer substantial promise for automating clinical text summarization, yet maintaining factual consistency remains challenging due to the length, noise, and heterogeneity of clinical documentation. We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content. The framework decomposes summarization into coordinated stages that compress task-relevant context, generate an initial draft, identify weakly supported spans using internal attention grounding signals, and selectively revise flagged content under supervisory control. We evaluate AgenticSum on two public datasets, using reference-based metrics, LLM-as-a-judge assessment, and human evaluation. Across various measures, AgenticSum demonstrates consistent improvements compared to vanilla LLMs and other strong baselines. Our results indicate that structured, agentic design with targeted correction offers an effective inference time solution to improve clinical note summarization using LLMs.

11. 【2602.20020】gencat: Generative computerized adaptive testing

链接https://arxiv.org/abs/2602.20020

作者:Wanyong Feng,Andrew Lan

类目:Computation and Language (cs.CL)

关键词:Existing computerized Adaptive, computerized Adaptive Testing, computerized Adaptive, Adaptive Testing, typically built

备注: 19 pages, 2 figures

点击查看摘要

Abstract:Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32\% in the key early testing stages.

12. 【2602.20017】QUIETT: Query-Independent Table Transformation for Robust Reasoning

链接https://arxiv.org/abs/2602.20017

作者:Gaurav Najpande,Tampu Ravi Kumar,Manan Roy Choudhury,Neha Valeti,Yanjie Fu,Vivek Gupta

类目:Computation and Language (cs.CL)

关键词:implicit relational structure, exhibit irregular schemas, heterogeneous value formats, Real-world tables, relational structure

备注

点击查看摘要

Abstract:Real-world tables often exhibit irregular schemas, heterogeneous value formats, and implicit relational structure, which degrade the reliability of downstream table reasoning and question answering. Most existing approaches address these issues in a query-dependent manner, entangling table cleanup with reasoning and thus limiting generalization. We introduce QuIeTT, a query-independent table transformation framework that preprocesses raw tables into a single SQL-ready canonical representation before any test-time queries are observed. QuIeTT performs lossless schema and value normalization, exposes implicit relations, and preserves full provenance via raw table snapshots. By decoupling table transformation from reasoning, QuIeTT enables cleaner, more reliable, and highly efficient querying without modifying downstream models. Experiments on four benchmarks, WikiTQ, HiTab, NQ-Table, and SequentialQA show consistent gains across models and reasoning paradigms, with particularly strong improvements on a challenge set of structurally diverse, unseen questions.

13. 【2602.19991】Cross-lingual Matryoshka Representation Learning across Speech and Text

链接https://arxiv.org/abs/2602.19991

作者:Yaya Sy,Dioula Doucouré,Christophe Cerisara,Irina Illina

类目:Computation and Language (cs.CL)

关键词:under-represented languages face, Speakers of under-represented, language barrier, primarily oral, under-represented languages

备注: Preprint, under review

点击查看摘要

Abstract:Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.

14. 【2602.19969】ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

链接https://arxiv.org/abs/2602.19969

作者:Yuxing Tian,Fengran Mo,Weixu Zhang,Yiyan Qi,Jian-Yun Nie

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, recent Large Language, Language Models, Large Language, zero-shot re-ranking task

备注: Accepted by EACL2026

点击查看摘要

Abstract:The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose \textbf{ReAttn}, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method.

15. 【2602.19961】Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

链接https://arxiv.org/abs/2602.19961

作者:Yibo Yan,Jiahao Huo,Guanbo Feng,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Yuanhuiyi Lyu,Yu Huang,Jungang Li,Kening Zheng,Xu Zheng,Philip S. Yu,James Kwok,Xuming Hu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:precise information acquisition, unstructured visually rich, visually rich data, Visual Document Retrieval, visual documents exhibit

备注: Under review

点击查看摘要

Abstract:With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.

16. 【2602.19948】Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

链接https://arxiv.org/abs/2602.19948

作者:Ian Steenstra,Paola Pedrelli,Weiyan Shi,Stacy Marsella,Timothy W. Bickmore

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词:Large Language Models, current safety benchmarks, longitudinal risks inherent, mental health support, Large Language

备注: This paper is a condensed version of the first author's Ph.D. dissertation submitted to Northeastern University

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and this http URL) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions ("AI Psychosis") and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the "black box" of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.

Comments:
This paper is a condensed version of the first author’s Ph.D. dissertation submitted to Northeastern University

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

Cite as:
arXiv:2602.19948 [cs.CL]

(or
arXiv:2602.19948v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.19948

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2602.19919】Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

链接https://arxiv.org/abs/2602.19919

作者:Xiang Li,Zikai Wei,Yiyan Qi,Wanyun Zhou,Xiang Liu,Penglei Sun,Yongqi Zhang,Xiaowen Chu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:purely numerical prediction, numerical prediction objectives, impacts are heterogeneous, driven by discrete, purely numerical

备注

点击查看摘要

Abstract:Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.

18. 【2602.19895】DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

链接https://arxiv.org/abs/2602.19895

作者:Zhongwei Wan,Yun Shen,Zhihao Dou,Donghao Zhou,Yu Zhang,Xin Wang,Hui Shen,Jing Xiong,Chaofan Tao,Zixuan Zhong,Peizhou Huang,Mi Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language model, improving large language, language model, central paradigm, paradigm for improving

备注

点击查看摘要

Abstract:Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at this https URL.

19. 【2602.19883】Denotational Semantics for ODRL: Knowledge-Based Constraint Conflict Detection

链接https://arxiv.org/abs/2602.19883

作者:Daham Mustafa,Diego Collarana,Yixin Peng,Rafiqul Haque,Christoph Lange-Bever,Christoph Quix,Stephan Decker

类目:Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:specification leaves unspecified, set-based operators, depend on external, specification leaves, leaves unspecified

备注: 17 pages, 6 tables. Working draft. Supplementary material (154 TPTP/SMT-LIB benchmarks, Isabelle/HOL theory file) will be made available at [this https URL](https://github.com/Daham-Mustaf/odrl-benchmark) upon publication

点击查看摘要

Abstract:ODRL's six set-based operators -- isA, isPartOf, hasPart, isAnyOf, isAllOf, isNoneOf -- depend on external domain knowledge that the W3C specification leaves unspecified. Without it, every cross-dataspace policy comparison defaults to Unknown. We present a denotational semantics that maps each ODRL constraint to the set of knowledge-base concepts satisfying it. Conflict detection reduces to denotation intersection under a three-valued verdict -- Conflict, Compatible, or Unknown -- that is sound under incomplete knowledge. The framework covers all three ODRL composition modes (and, or, xone) and all three semantic domains arising in practice: taxonomic (class subsumption), mereological (part-whole containment), and nominal (identity). For cross-dataspace interoperability, we define order-preserving alignments between knowledge bases and prove two guarantees: conflicts are preserved across different KB standards, and unmapped concepts degrade gracefully to Unknown -- never to false conflicts. A runtime soundness theorem ensures that design-time verdicts hold for all execution contexts. The encoding stays within the decidable EPR fragment of first-order logic. We validate it with 154 benchmarks across six knowledge base families (GeoNames, ISO 3166, W3C DPV, a GDPR-derived taxonomy, BCP 47, and ISO 639-3) and four structural KBs targeting adversarial edge cases. Both the Vampire theorem prover and the Z3 SMT solver agree on all 154 verdicts. A key finding is that exclusive composition (xone) requires strictly stronger KB axioms than conjunction or disjunction: open-world semantics blocks exclusivity even when positive evidence appears to satisfy exactly one branch.

20. 【2602.19878】Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics

链接https://arxiv.org/abs/2602.19878

作者:Daham Mustafa,Diego Collarana,Yixin Peng,Rafiqul Haque,Christoph Lange-Bever,Christoph Quix,Stephan Decker

类目:Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Toggle, ODRL, Toggle Hugging Face, Code, Code Toggle Papers

备注: 16 pages, 5 tables. Preprint

点击查看摘要

Abstract:Every ODRL 2.2 constraint compares a single scalar value: (leftOperand, operator, rightOperand). Five of ODRL's approximately 34 left operands, however, denote multi-dimensional quantities--image dimensions, canvas positions, geographic coordinates--whose specification text explicitly references multiple axes. For these operands, a single scalar constraint admits one interpretation per axis, making policy evaluation non-deterministic. We classify ODRL's left operands by value-domain structure (scalar, dimensional, concept-valued), grounded in the ODRL 2.2 specification text, and show that dimensional ambiguity is intrinsic to the constraint syntax. We present an axis-decomposition framework that refines each dimensional operand into axis-specific scalar operands and prove four properties: deterministic interpretation, AABB completeness, sound over-approximation under projection, and conservative extension. Conflict detection operates in two layers: per-axis verdicts are always decidable; box-level verdicts compose through Strong Kleene conjunction into a three-valued logic (Conflict, Compatible, Unknown). For ODRL's disjunctive (odrl:or) and exclusive-or (odrl:xone) logical constraints, where per-axis decomposition does not apply, the framework encodes coupled multi-axis conjectures directly. We instantiate the framework as the ODRL Spatial Axis Profile--15 axis-specific left operands for the five affected base terms--and evaluate it on 117 benchmark problems spanning nine categories across both TPTP FOF (Vampire) and SMT-LIB (Z3) encodings, achieving full concordance between provers. Benchmark scenarios are inspired by constraints arising in cultural heritage dataspaces such as Datenraum Kultur. All meta-theorems are mechanically verified in Isabelle/HOL.

Comments:
16 pages, 5 tables. Preprint

Subjects:

Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

ACMclasses:
F.4.1; D.2.4

Cite as:
arXiv:2602.19878 [cs.CL]

(or
arXiv:2602.19878v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.19878

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Daham Mustafa [view email] [v1]
Mon, 23 Feb 2026 14:24:46 UTC (19 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics, by Daham Mustafa and 6 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CL

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.LO

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

21. 【2602.19855】SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals

链接https://arxiv.org/abs/2602.19855

作者:Francois Vandenhende,Anna Georgiou,Theodoros Psaras,Ellie Karekla

类目:Computation and Language (cs.CL)

关键词:methodology for automated, automated and integrated, present SHIELD, SHIELD combines disproportionality, SHIELD

备注: 3 figures, 1 table

点击查看摘要

Abstract:We present SHIELD, a novel methodology for automated and integrated safety signal detection in clinical trials. SHIELD combines disproportionality analysis with semantic clustering of adverse event (AE) terms applied to MedDRA term embeddings. For each AE, the pipeline computes an information-theoretic disproportionality measure (Information Component) with effect size derived via empirical Bayesian shrinkage. A utility matrix is constructed by weighting semantic term-term similarities by signal magnitude, followed by spectral embedding and clustering to identify groups of related AEs. Resulting clusters are annotated with syndrome-level summary labels using large language models, yielding a coherent, data-driven representation of treatment-associated safety profiles in the form of a network graph and hierarchical tree. We implement the SHIELD framework in the context of a single-arm incidence summary, to compare two treatment arms or for the detection of any treatment effect in a multi-arm trial. We illustrate its ability to recover known safety signals and generate interpretable, cluster-based summaries in a real clinical trial example. This work bridges statistical signal detection with modern natural language processing to enhance safety assessment and causal interpretation in clinical trials.

22. 【2602.19840】SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

链接https://arxiv.org/abs/2602.19840

作者:Jingzhuo Wu,Jiajun Zhang,Keyan Jin,Dehua Ma,Junbo Wang

类目:Computation and Language (cs.CL)

关键词:Modern large language, large language models, Modern large, language models, excel at generating

备注

点击查看摘要

Abstract:Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author's unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations. To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task. Specifically, our method quantifies literary style into a Stylistic Feature Spectrum (SFS) using the wavelet packet transform. This SFS serves as a control signal to dynamically assemble a tailored workflow of specialized translation agents based on the source text's structural patterns. Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.

23. 【2602.19815】Keyboards for the Endangered Idu Mishmi Language

链接https://arxiv.org/abs/2602.19815

作者:Akhilesh Kakolu Ramarao

类目:Computation and Language (cs.CL)

关键词:Arunachal Pradesh, people in Arunachal, Trans-Himalayan language spoken, Idu Mishmi, desktop keyboard suite

备注

点击查看摘要

Abstract:We present a mobile and desktop keyboard suite for Idu Mishmi, an endangered Trans-Himalayan language spoken by approximately 11,000 people in Arunachal Pradesh, India. Although a Latin-based orthography was developed in 2018, no digital input tools existed to use it, forcing speakers into ad-hoc romanizations that cannot represent the full writing system. Our keyboards comprise two tools: (1) an Android mobile keyboard, published on the Google Play Store and actively used in teacher training programs, and (2) a Windows desktop keyboard currently undergoing community testing. Both tools support the complete Idu Mishmi character inventory, including schwa, retracted schwa, nasalized vowels, and accented forms. Both operate fully offline with zero network permissions, addressing connectivity constraints and data sovereignty concerns. We describe the design, implementation, and deployment as a replicable model for other endangered language communities.

24. 【2602.19743】NILE: Formalizing Natural-Language Descriptions of Formal Languages

链接https://arxiv.org/abs/2602.19743

作者:Tristan Kneisel,Marko Schmellenkamp,Fabian Vehlken,Thomas Zeume

类目:Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:formal languages, natural-language descriptions, formal, languages, formal language

备注

点击查看摘要

Abstract:This paper explores how natural-language descriptions of formal languages can be compared to their formal representations and how semantic differences can be explained. This is motivated from educational scenarios where learners describe a formal language (presented, e.g., by a finite state automaton, regular expression, pushdown automaton, context-free grammar or in set notation) in natural language, and an educational support system has to (1) judge whether the natural-language description accurately describes the formal language, and to (2) provide explanations why descriptions are not accurate. To address this question, we introduce a representation language for formal languages, Nile, which is designed so that Nile expressions can mirror the syntactic structure of natural-language descriptions of formal languages. Nile is sufficiently expressive to cover a broad variety of formal languages, including all regular languages and fragments of context-free languages typically used in educational contexts. Generating Nile expressions that are syntactically close to natural-language descriptions then allows to provide explanations for inaccuracies in the descriptions algorithmically. In experiments on an educational data set, we show that LLMs can translate natural-language descriptions into equivalent, syntactically close Nile expressions with high accuracy - allowing to algorithmically provide explanations for incorrect natural-language descriptions. Our experiments also show that while natural-language descriptions can also be translated into regular expressions (but not context-free grammars), the expressions are often not syntactically close and thus not suitable for providing explanations.

Subjects:

Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

Cite as:
arXiv:2602.19743 [cs.FL]

(or
arXiv:2602.19743v1 [cs.FL] for this version)

https://doi.org/10.48550/arXiv.2602.19743

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
25. 【2602.19643】KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

链接https://arxiv.org/abs/2602.19643

作者:Alex Robertson,Huizhi Liang,Mahbub Gani,Rohit Kumar,Srijith Rajamohan

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, intelligible language, possess a remarkable, remarkable capacity

备注: EACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.

26. 【2602.19626】Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

链接https://arxiv.org/abs/2602.19626

作者:Roberto Tacconelli

类目:Information Theory (cs.IT); Computation and Language (cs.CL)

关键词:lightweight online predictors, lossless compression system, arithmetic coder, transformer language model, natural language text

备注: 10 pages

点击查看摘要

Abstract:We present Nacrith, a lossless compression system that combines a 135M-parameter transformer language model (SmolLM2-135M) with an ensemble of lightweight online predictors and a 32-bit arithmetic coder, achieving the best compression results among the systems evaluated in this study on natural language text. Beyond the base LLM-plus-arithmetic-coding paradigm, Nacrith introduces several contributions: (1) a CDF precision upgrade from 2^16 to 2^24 that eliminates ~75% of quantization overhead caused by minimum-probability floors in large vocabularies; (2) a token-level N-gram model for fast local predictions; (3) an adaptive log-space bias head correcting per-document LLM errors via online gradient descent; (4) confidence-based LLM skip for accelerating highly predictable tokens; (5) a hybrid binary format (NC06) extending neural compression to arbitrary binary files--to our knowledge a first among LLM-based compressors; (6) a llama cpp inference backend achieving ~7x faster single-token decode than PyTorch; (7) parallel multi-GPU compression across up to 8 workers; and (8) native KV cache sliding window reducing per-slide cost by ~37x. The system requires only ~500 MB of GGUF weights and ~1.2 GB VRAM per worker, running on consumer GPUs. On alice29 (Canterbury Corpus, 152 KB), Nacrith achieves 0.918 bits per byte (bpb)--outperforming gzip by 3.1x, bzip2 by 2.5x, CMIX v21 by 44%, and ts_zip by 20%, while compressing below the 0th-, 1st-, and 2nd-order byte-level Shannon entropy bounds. On enwik8 (100 MB), Nacrith achieves 0.9389 bpb (11.74%), surpassing ts_zip (~1.11 bpb) by 15% and FineZip (1.024 bpb) by 8% despite using a 60x smaller model with no fine-tuning. An out-of-distribution (OOD) evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.

Comments:
10 pages

Subjects:

Information Theory (cs.IT); Computation and Language (cs.CL)

Cite as:
arXiv:2602.19626 [cs.IT]

(or
arXiv:2602.19626v2 [cs.IT] for this version)

https://doi.org/10.48550/arXiv.2602.19626

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
27. 【2602.19612】Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

链接https://arxiv.org/abs/2602.19612

作者:Borisiuk Anna,Andrey Savchenko,Alexander Panchecko,Elena Tutubalina

类目:Computation and Language (cs.CL)

关键词:enables Large Language, Large Language Models, Large Language, enables Large, Machine Unlearning

备注

点击查看摘要

Abstract:Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

28. 【2602.19598】Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support

链接https://arxiv.org/abs/2602.19598

作者:Deborah N. Jakobi,David R. Reich,Paul Prasse,Jana M. Hofmann,Lena S. Bolliger,Lena A. Jäger

类目:Computation and Language (cs.CL)

关键词:valuable resource, datasets, existing datasets, processes underlying reading, cases range

备注

点击查看摘要

Abstract:Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, this https URL, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.

29. 【2602.19583】DEEP: Docker-based Execution and Evaluation Platform

链接https://arxiv.org/abs/2602.19583

作者:Sergio Gómez González,Miguel Domingo,Francisco Casacuberta

类目:Computation and Language (cs.CL)

关键词:Comparative evaluation, recurrent task, Comparative, DEEP, model

备注

点击查看摘要

Abstract:Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other tasks. DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references. With this approach, evaluators can achieve a better understanding of the performance of each model. Moreover, the software uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics. As a result, evaluators are able to identify clusters of performance among the swarm of proposals and have a better understanding of the significance of their differences. Additionally, we offer a visualization web-app to ensure that the results can be adequately understood and interpreted. Finally, we present an exemplary case of use of DEEP.

30. 【2602.19569】mporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering

链接https://arxiv.org/abs/2602.19569

作者:Wuzhenghong Wen,Bowen Zhou,Jinwen Huang,Xianjie Wu,Yuwei Sun,Su Pan,Liang Li,Jianting Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:handling time-sensitive queries, attracted growing interest, Question Answering, time-sensitive queries, attracted growing

备注: 6pages

点击查看摘要

Abstract:Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion. Specifically, our approach introduces: 1) a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics; 2) a temporal-aware graph neural network for explicit multi-hop reasoning via time-aware message passing; and 3) a multi-view attention mechanism for more effective fusion of question context and temporal graph knowledge. Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.

31. 【2602.19549】Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

链接https://arxiv.org/abs/2602.19549

作者:Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Jiahao Huo,Shuliang Liu,James Kwok,Xuming Hu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Visual Document Retrieval, multimodal retrieval applications, Visual Document, current multimodal retrieval, retrieve relevant pages

备注: Under review

点击查看摘要

Abstract:Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

32. 【2602.19548】Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

链接https://arxiv.org/abs/2602.19548

作者:Jeffrey Li,Josh Gardner,Doug Kang,Fangping Shi,Karanjeet Singh,Chun-Liang Li,Herumb Shandilya,David Hall,Oncel Tuzel,Percy Liang,Ludwig Schmidt,Hadi Pour Ansari,Fartash Faghri

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:web-scale LLM pretraining, constructing web-scale LLM, LLM pretraining datasets, involves extracting text, text from HTML

备注

点击查看摘要

Abstract:One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

33. 【2602.19543】Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

链接https://arxiv.org/abs/2602.19543

作者:Rizhuo Huang,Yifan Feng,Rundong Xue,Shihui Ying,Jun-Hai Yong,Chuan Shi,Shaoyi Du,Yue Gao

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:ary atomic facts, hypergraphs surpass traditional, surpass traditional binary, Knowledge hypergraphs surpass, ary atomic

备注

点击查看摘要

Abstract:Knowledge hypergraphs surpass traditional binary knowledge graphs by encapsulating complex $n$-ary atomic facts, providing a more comprehensive paradigm for semantic representation. However, constructing high-quality hypergraphs remains challenging due to the \textit{scenario gap}: generic extractors struggle to generalize across diverse domains with specific jargon, while existing methods often fail to balance structural skeletons with fine-grained details. To bridge this gap, we propose \textbf{Hyper-KGGen}, a skill-driven framework that reformulates extraction as a dynamic skill-evolving process. First, Hyper-KGGen employs a \textit{coarse-to-fine} mechanism to systematically decompose documents, ensuring full-dimensional coverage from binary links to complex hyperedges. Crucially, it incorporates an \textit{adaptive skill acquisition} module that actively distills domain expertise into a Global Skill Library. This is achieved via a stability-based feedback loop, where extraction stability serves as a relative reward signal to induce high-quality skills from unstable traces and missed predictions. Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction. Experiments demonstrate that Hyper-KGGen significantly outperforms strong baselines, validating that evolved skills provide substantially richer guidance than static few-shot examples in multi-scenario settings.

34. 【2602.19526】How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

链接https://arxiv.org/abs/2602.19526

作者:Yinuo Xu,Shuo Lu,Jianjie Cheng,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He,Jian Liang

类目:Computation and Language (cs.CL)

关键词:agents tackle knowledge-intensive, tackle knowledge-intensive tasks, Research agents tackle, Deep Research agents, decision-oriented generation

备注

点击查看摘要

Abstract:Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

35. 【2602.19517】Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

链接https://arxiv.org/abs/2602.19517

作者:Chongyang Gao,Diji Yang,Shuyan Zhou,Xichen Yan,Luchuan Song,Shuo Li,Kezhen Chen

类目:Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:STEM domains, textbf, CFE, large language models, multimodal benchmark

备注

点击查看摘要

Abstract:We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at this https URL.

36. 【2602.19509】Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

链接https://arxiv.org/abs/2602.19509

作者:Arindam Khaled

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, face a persistent, reasoning capability

备注: 6 pages, 4 figures, 1 table

点击查看摘要

Abstract:Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies "hard" problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.

37. 【2602.19467】Can Large Language Models Replace Human Coders? Introducing ContentBench

链接https://arxiv.org/abs/2602.19467

作者:Michael Haman

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:empirical content analysis, low-cost large language, interpretive coding work, large language models, interpretive coding

备注: Project website: [this https URL](https://contentbench.github.io)

点击查看摘要

Abstract:Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-ResearchTalk v1.0: 1,000 synthetic, social-media-style posts about academic research labeled into five categories spanning praise, critique, sarcasm, questions, and procedural remarks. Reference labels are assigned only when three state-of-the-art reasoning models (GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1) agree unanimously, and all final labels are checked by the author as a quality-control audit. Among the 59 evaluated models, the best low-cost LLMs reach roughly 97-99% agreement with these jury labels, far above GPT-3.5 Turbo, the model behind early ChatGPT and the initial wave of LLM-based text annotation. Several top models can code 50,000 posts for only a few dollars, pushing large-scale interpretive coding from a labor bottleneck toward questions of validation, reporting, and governance. At the same time, small open-weight models that run locally still struggle on sarcasm-heavy items (for example, Llama 3.2 3B reaches only 4% agreement on hard-sarcasm). ContentBench is released with data, documentation, and an interactive quiz at this http URL to support comparable evaluations over time and to invite community extensions.

38. 【2602.19463】PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives

链接https://arxiv.org/abs/2602.19463

作者:Emma Jiren Wang,Siying Hu,Zhicong Lu

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:modern intimate relationships, sustaining modern intimate, facilitates frequent connection, instant messaging facilitates, messaging facilitates frequent

备注: 19 pages, 8 figures; Accepted by ACM CHI 2026. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI'24)

点击查看摘要

Abstract:As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances. However, today's tools often dilute care; they favor single tap reactions and vague emojis that do not support two way action responses, do not preserve the feeling that the exchange keeps going without breaking, and are weakly tied to who we are and what we share. To address this challenge, we present PuppetChat, a dyadic messaging prototype that restores this expressive depth through embodied interaction. PuppetChat uses a reciprocity aware recommender to encourage responsive actions and generates personalized micronarratives from user stories to ground interactions in personal history. Our 10-day field study with 11 dyads of close partners or friends revealed that this approach enhanced social presence, supported more expressive self disclosure, and sustained continuity and shared memories.

39. 【2602.19455】SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

链接https://arxiv.org/abs/2602.19455

作者:Zelin He,Boran Han,Xiyuan Zhang,Shuai Zhang,Haotian Lin,Qi Zhu,Haoyang Fang,Danielle C. Maddix,Abdul Fatir Ansari,Akash Chandrayan,Abhinav Pradhan,Bernie Wang,Matthew Reimherr

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:large language models, existing solutions face, general reasoning large, reasoning large language, understand complex time-series

备注: Accepted by the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)

点击查看摘要

Abstract:Time-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. To bridge this gap, we propose a hybrid knowledge-injection framework that injects TSLM-generated insights directly into GRLM's reasoning trace, thereby achieving strong time-series reasoning with in-domain knowledge. As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an in-domain thinking trace into GRLM for efficient knowledge injection. We further release SenTSR-Bench, a multivariate time-series-based diagnostic reasoning benchmark collected from real-world industrial operations. Across SenTSR-Bench and other public datasets, our method consistently surpasses TSLMs by 9.1%-26.1% and GRLMs by 7.9%-22.4%, delivering robust, context-aware time-series diagnostic insights.

40. 【2602.19403】Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

链接https://arxiv.org/abs/2602.19403

作者:Jasmin Han(1),Janardan Devkota(1),Joseph Waring(1),Amanda Luken(2),Felix Naughton(3),Roger Vilardaga(4),Jonathan Bricker(5 and 6),Carl Latkin(7),Meghan Moran(7),Yiqun Chen(8 and 9),Johannes Thrul(1 and 10 and 11) ((1) Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA, (2) Department of Health Sciences, Towson University, Towson, USA, (3) Addiction Research Group, University of East Anglia, Norwich, UK, (4) Department of Implementation Science, Wake Forest University School of Medicine, Winston-Salem, USA, (5) Fred Hutchinson Cancer Center, Seattle, USA, (6) Department of Psychology, University of Washington, Seattle, USA, (7) Department of Health, Behavior and Society, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA, (8) Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA, (9) Department of Computer Science, Johns Hopkins Whiting School of Engineering, Baltimore, USA, (10) Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, USA, (11) Centre for Alcohol Policy Research, La Trobe University, Melbourne, Australia)

类目:Computation and Language (cs.CL); Applications (stat.AP)

关键词:Perceived message effectiveness, Perceived message, smoking cessation, optimizing personalized smoking, LLM-based digital twins

备注: 31 pages, 5 figures, submitted to Journal of the American Medical Informatics Association (JAMIA). Drs. Chen and Thrul share last authorship

点击查看摘要

Abstract:Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratings (5-point Likert scale) from 301 young adult smokers. We compared (1) supervised learning models trained on labeled data, (2) zero and few-shot LLMs prompted without task-specific fine-tuning, and (3) LLM-based digital twins that incorporate individual characteristics and prior PME histories to generate personalized predictions. Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1. LLM-based digital twins outperformed zero and few-shot LLMs (12 percentage points on average) and supervised baselines (13 percentage points), achieving accuracies of 0.49 (content), 0.45 (coping), and 0.49 (quitting), with directional accuracies of 0.75, 0.66, and 0.70 on a simplified 3-point scale. Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences. Integrating personal profiles with LLMs captures person-specific differences in PME and outperforms supervised and zero and few-shot approaches. Improved PME prediction may enable more tailored intervention content in mHealth. LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions.

Comments:
31 pages, 5 figures, submitted to Journal of the American Medical Informatics Association (JAMIA). Drs. Chen and Thrul share last authorship

Subjects:

Computation and Language (cs.CL); Applications (stat.AP)

Cite as:
arXiv:2602.19403 [cs.CL]

(or
arXiv:2602.19403v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.19403

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jiuying Han [view email] [v1]
Mon, 23 Feb 2026 00:32:23 UTC (3,509 KB)

41. 【2602.19385】Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

链接https://arxiv.org/abs/2602.19385

作者:Minxue Tang,Yangyang Yu,Aolin Ding,Maziyar Baran Pouyan,Taha Belkhouja Yujia Bao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recognizing implicit visual, visual and textual, real-world applications, applications of modern, Recognizing implicit

备注

点击查看摘要

Abstract:Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.

42. 【2602.19333】PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

链接https://arxiv.org/abs/2602.19333

作者:Isun Chehreh,Ebrahim Ansari

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:well-balanced Persian social, Persian social media, specifically designed, Persian social, designed to address

备注: 10 pages, including 1 figure

点击查看摘要

Abstract:This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

43. 【2602.19320】Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

链接https://arxiv.org/abs/2602.19320

作者:Dongming Jiang,Yi Li,Songtao Wei,Jinxin Yang,Ayushi Kishore,Alysa Zhao,Dingyi Kang,Xu Hu,Feng Chen,Qiannan Li,Bingzhe Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:supporting long-horizon reasoning, fixed context windows, enable large language, large language model, systems enable large

备注

点击查看摘要

Abstract:Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

44. 【2602.19317】Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

链接https://arxiv.org/abs/2602.19317

作者:Maryam Amirizaniani,Alireza Salemi,Hamed Zamani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Question Answering, requires answers, users' background, accurate and aligned, aligned with users'

备注

点击查看摘要

Abstract:Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.

45. 【2602.19212】Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

链接https://arxiv.org/abs/2602.19212

作者:Raihan Tanvir,Md. Golam Rabiul Alam

类目:Computation and Language (cs.CL)

关键词:convey harmful narratives, social media increasingly, Bengali Hateful Memes, harmful narratives, social media

备注

点击查看摘要

Abstract:Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

46. 【2602.19177】Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

链接https://arxiv.org/abs/2602.19177

作者:Simon Münker,Nils Schwager,Kai Kugler,Michael Heseltine,Achim Rettinger

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models, Large Language, paradigm shift, science research presents

备注: 8 pages (12 including references), 2 figures and 2 tables

点击查看摘要

Abstract:The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

47. 【2602.19174】urkicNLP: An NLP Toolkit for Turkic Languages

链接https://arxiv.org/abs/2602.19174

作者:Sherzod Hakimov

类目:Computation and Language (cs.CL)

关键词:Natural language processing, Turkic language family, lacking unified tooling, languages lacking unified, people across Eurasia

备注

点击查看摘要

Abstract:Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at this https URL .

48. 【2602.19160】Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing

链接https://arxiv.org/abs/2602.19160

作者:Maciej Świechowski,Adam Żychowski,Jacek Mańdziuk

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Large Language Models, Large Language, General Game Playing, rule-governed environments, Language Models

备注

点击查看摘要

Abstract:This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.

49. 【2602.19159】Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

链接https://arxiv.org/abs/2602.19159

作者:Francesca Bianco,Derek Shiller

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLMs alter choices, Prior behavioural work, LLMs alter, options are framed, framed as causing

备注: 24 pages, 8+1 Tables

点击查看摘要

Abstract:Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in attention/MLP outputs, and decision alignment is highest slightly before the final token; (c) additive steering along a data-derived valence direction causally modulates the 2-3 margin at late sites, with the largest effects observed in late-layer attention outputs (attn_out L14); and (d) head-level patching/ablation suggests that these effects are distributed across multiple heads rather than concentrated in a single unit. Together, these results link behavioural sensitivity to identifiable internal representations and intervention-sensitive sites, providing concrete mechanistic targets for more stringent counterfactual tests and broader replication. This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.

50. 【2602.19157】Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

链接https://arxiv.org/abs/2602.19157

作者:Wenqiu Tang,Zhen Wan,Takahiro Komamizu,Ichiro Ide

类目:Computation and Language (cs.CL)

关键词:Role-Playing Agents, inject persona descriptions, retrieval-augmented generation, supervised fine-tuning, persona-specific corpora

备注: Accepted in PAKDD 2026 special session on Data Science :Foundation and Applications

点击查看摘要

Abstract:Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model's residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.

51. 【2602.19146】VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

链接https://arxiv.org/abs/2602.19146

作者:Diogo Glória-Silva,David Semedo,João Maglhães

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:multi-step instructional video, instructional video action, Instructional Video Dialogues, reason over complex, dialogue model designed

备注: Accepted at EACL 2026 Findings

点击查看摘要

Abstract:We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

52. 【2602.19133】A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions

链接https://arxiv.org/abs/2602.19133

作者:Stefanie Schneider,Miriam Göldl,Julian Stalter,Ricarda Vollmer

类目:Computation and Language (cs.CL)

关键词:Named Entity Recognition, Fine-grained Recognition, Relation Extraction, paper introduces FRAME, art-historical image descriptions

备注

点击查看摘要

Abstract:This paper introduces FRAME (Fine-grained Recognition of Art-historical Metadata and Entities), a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE). Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases, then filtered to ensure that each text focuses on a single artwork and contains explicit statements about its material, composition, or iconography. FRAME provides stand-off annotations in three layers: a metadata layer for object-level properties, a content layer for depicted subjects and motifs, and a co-reference layer linking repeated mentions. Across layers, entity spans are labeled with 37 types and connected by typed RE links between mentions. Entity types are aligned with Wikidata to support Named Entity Linking (NEL) and downstream knowledge-graph construction. The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Language Models (LLMs).

53. 【2602.19127】AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

链接https://arxiv.org/abs/2602.19127

作者:Qijie You,Wenkai Yu,Wentao Zhang

类目:Computation and Language (cs.CL)

关键词:important research direction, Agentic RAG, recent years, rapid advancement, advancement of agent-based

备注

点击查看摘要

Abstract:With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at this https URL.

54. 【2602.19115】How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

链接https://arxiv.org/abs/2602.19115

作者:Michael McCoubrey,Angelo Salatino,Francesco Osborne,Enrico Motta

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)

关键词:large language models, recent years, language models, large language, assessment and generation

备注: Presented at SESAME 2025: Smarter Extraction of ScholArly MEtadata using Knowledge Graphs and Language Models, @ JCDL 2025

点击查看摘要

Abstract:In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that LLMs can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how LLMs encode the concept of scientific quality through relevant monosemantic features extracted using sparse autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that LLMs encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how LLMs encapsulate concepts related to research quality.

55. 【2602.19111】Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models

链接https://arxiv.org/abs/2602.19111

作者:Kainan Liu,Yong Zhang,Ning Cheng,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao

类目:Computation and Language (cs.CL)

关键词:adapting pre-trained models, storage efficiency, adapting pre-trained, computational and storage, tail eigenvectors

备注: 22 pages, 10 figures

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA, are widely used for adapting pre-trained models to downstream tasks due to their computational and storage efficiency. However, in the context of LoRA and its variants, the potential of activation subspaces corresponding to tail eigenvectors remains substantially under-exploited, which may lead to suboptimal fine-tuning performance. In this work, we propose Astra (Activation-Space Tail-Eigenvector Low-Rank Adaptation), a novel PEFT method that leverages the tail eigenvectors of the model output activations-estimated from a small task-specific calibration set-to construct task-adaptive low-rank adapters. By constraining updates to the subspace spanned by these tail eigenvectors, Astra achieves faster convergence and improved downstream performance with a significantly reduced parameter budget. Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning (FFT) in certain scenarios.

56. 【2602.19101】Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

链接https://arxiv.org/abs/2602.19101

作者:Seong Hah Cho,Junyi Li,Anna Leshinskaya

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, alignment of Large, Language Models, models' actual

备注

点击查看摘要

Abstract:Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

57. 【2602.19079】riTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes

链接https://arxiv.org/abs/2602.19079

作者:Roman Egger

类目:Computation and Language (cs.CL)

关键词:large text collections, face critical limitations, single data perspective, modeling extracts latent, extracts latent themes

备注: 11 pages, 7 figures

点击查看摘要

Abstract:Topic modeling extracts latent themes from large text collections, but leading approaches like BERTopic face critical limitations: stochastic instability, loss of lexical precision ("Embedding Blur"), and reliance on a single data perspective. We present TriTopic, a framework that addresses these weaknesses through a tri-modal graph fusing semantic embeddings, TF-IDF, and metadata. Three core innovations drive its performance: hybrid graph construction via Mutual kNN and Shared Nearest Neighbors to eliminate noise and combat the curse of dimensionality; Consensus Leiden Clustering for reproducible, stable partitions; and Iterative Refinement that sharpens embeddings through dynamic centroid-pulling. TriTopic also replaces the "average document" concept with archetype-based topic representations defined by boundary cases rather than centers alone. In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs. 0.513 for BERTopic, 0.416 for NMF, 0.299 for LDA), guarantees 100% corpus coverage with 0% outliers, and is available as an open-source PyPI library.

Comments:
11 pages, 7 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.19079 [cs.CL]

(or
arXiv:2602.19079v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.19079

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
58. 【2602.19058】Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

链接https://arxiv.org/abs/2602.19058

作者:Chenhang Cui,An Zhang,Yuxin Chen,Gelei Deng,Jingnan Zheng,Zhenkai Liang,Xiang Wang,Tat-Seng Chua

类目:Computation and Language (cs.CL)

关键词:strong text-only large, Large vision-language models, text-only large language, compositional decision-making, rapidly advanced

备注

点击查看摘要

Abstract:Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top-activated units during multi-step inference are shared between representative LLMs and LVLMs, revealing a modality-invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from LLMs to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at [this https URL](this https URL).

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.19058 [cs.CL]

(or
arXiv:2602.19058v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.19058

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
59. 【2602.19049】IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

链接https://arxiv.org/abs/2602.19049

作者:Yinhan He,Yaochen Zhu,Mingjia Shi,Wendy Zheng,Lin Su,Xiaoqing Wang,Qi Guo,Jundong Li

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, substantial inference-time costs, language models increasingly, models increasingly rely, Large language

备注

点击查看摘要

Abstract:Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at this https URL.

60. 【2602.19043】Uncovering Context Reliance in Unstructured Knowledge Editing

链接https://arxiv.org/abs/2602.19043

作者:Zisheng Zhou,Mengqi Zhang,Shiguang Wu,Xiaotian Ye,Chi Zhang,Zhumin Chen,Pengjie Ren

类目:Computation and Language (cs.CL)

关键词:Editing Large language, Large language models, Large language, Context Reliance, internal parametric knowledge

备注: 21 pages, 14 figures

点击查看摘要

Abstract:Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing. We identify Context Reliance as a critical failure mode of NTP-based approaches, where knowledge acquired from edited text becomes highly dependent on its preceding context, leading to recall failures when that context is absent during inference. This hypothesis is supported by our empirical validation that prepending context during inference recovers knowledge recall. We further theoretically demonstrate that Context Reliance is an inherent consequence of gradient-based optimization, which tends to bind acquired knowledge to a specific aggregated contextual representation. To address this, we propose a simple yet effective COntext-INdependent editing framework (COIN), encouraging model to focus on knowledge within local scope rather than memorizing contextual patterns. Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.

61. 【2602.19020】Learning to Detect Language Model Training Data via Active Reconstruction

链接https://arxiv.org/abs/2602.19020

作者:Junjie Oscar Yin,John X. Morris,Vitaly Shmatikov,Sewon Min,Hannaneh Hajishirzi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Detecting LLM training, LLM training data, Detecting LLM, LLM training, generally framed

备注

点击查看摘要

Abstract:Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

62. 【2602.19008】Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

链接https://arxiv.org/abs/2602.19008

作者:Wilson Y. Lee

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:capable of solving, language agents fail, canonical solution path, runs, language agents

备注

点击查看摘要

Abstract:Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This result survives six robustness checks including cross-model-family leave-one-out validation. Critically, the causal mechanism is gradual and self-reinforcing: the adherence gap is statistically indistinguishable from zero through the first 50% of the trajectory, ruling out early-branching selection bias, and each off-canonical tool call raises the probability that the next call is also off-canonical by 22.7 percentage points ($\hat{\beta}=+0.227$, $p0.0001$), more than doubling the baseline rate. These findings imply that agent reliability cannot be improved by capability scaling alone, but offer a highly actionable intervention: a simple monitor that restarts the bottom tercile of runs based on mid-trajectory canonical adherence lifts success rates by $+$8.8 percentage points among intervened runs.

63. 【2602.18998】Benchmark Test-Time Scaling of General LLM Agents

链接https://arxiv.org/abs/2602.18998

作者:Xiaochuan Li,Ryan Ming,Pranav Setlur,Abhijay Paladugu,Andy Tang,Hao Kang,Shuai Shao,Rong Jin,Chenyan Xiong

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:open-ended user requests, resolving open-ended user, general-purpose systems capable, LLM agents, user requests

备注

点击查看摘要

Abstract:LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at this https URL.

64. 【2602.18966】Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

链接https://arxiv.org/abs/2602.18966

作者:Yonathan Ron,Shiri Gilboa,Tammuz Dubnov

类目:Computation and Language (cs.CL)

关键词:Domain-specific speech remains, Domain-specific speech, automatic speech recognition, systems like OpenAI, speech remains

备注

点击查看摘要

Abstract:Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

65. 【2602.18964】Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

链接https://arxiv.org/abs/2602.18964

作者:Toheeb Aduramomi Jimoh,Tabea De Wille,Nikola S. Nikolov

类目:Computation and Language (cs.CL)

关键词:requiring models, intended meaning, poses a fundamental, models to resolve, resolve disparities

备注

点击查看摘要

Abstract:Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yorùbá, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yorùbá sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' $\kappa = 0.7660$; pairwise Cohen's $\kappa = 0.6732$--$0.8743$), with $83.3\%$ unanimous consensus. One annotator pair achieved almost perfect agreement ($\kappa = 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining $16.7\%$ majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarc\footnote{this https URL} is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.

66. 【2602.18922】Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

链接https://arxiv.org/abs/2602.18922

作者:Abhinaba Basu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:agents incur substantial, repeated LLM calls, Personal AI agents, incur substantial cost, agents incur

备注: 28 pages, 15 figures, 8 tables, 5 appendices

点击查看摘要

Abstract:Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.

Comments:
28 pages, 15 figures, 8 tables, 5 appendices

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

MSC classes:
68T50, 62H30, 94A15

ACMclasses:
I.2.7; H.3.3; I.5.3

Cite as:
arXiv:2602.18922 [cs.CL]

(or
arXiv:2602.18922v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.18922

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
67. 【2602.18920】DeepInnovator: Triggering the Innovative Capabilities of LLMs

链接https://arxiv.org/abs/2602.18920

作者:Tianyu Fan,Fengji Zhang,Yuxiang Zheng,Bei Chen,Xinyao Niu,Chengen Huang,Junyang Lin,Chao Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, garnered increasing attention, application of Large, accelerating scientific discovery

备注

点击查看摘要

Abstract:The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator-14B significantly outperforms untrained baselines, achieving win rates of 80.53\%-93.81\%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open-source the dataset to foster community advancement. Source code and data are available at: this https URL.

68. 【2602.18905】RUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

链接https://arxiv.org/abs/2602.18905

作者:Yujiao Yang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, demonstrated strong capabilities, decision-making processes remain, processes remain difficult, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.

69. 【2602.18823】EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation

链接https://arxiv.org/abs/2602.18823

作者:Adam Dejl,Jonathan Pearson

类目:Computation and Language (cs.CL)

关键词:Robust and comprehensive, identifying effective LLM, effective LLM system, LLM system configurations, large language models

备注: Accepted to EACL 2026 System Demonstrations

点击查看摘要

Abstract:Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use-cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta-evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor-patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open-source and publicly available at this https URL.

70. 【2602.18806】hink$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.18806

作者:Abraham Paul Elenjical,Vivek Hruday Kavuri,Vasudeva Varma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, errors remains limited, demonstrate strong reasoning

备注

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.

71. 【2602.18788】BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

链接https://arxiv.org/abs/2602.18788

作者:Thura Aung,Jann Railey Montalan,Jian Gang Ngui,Peerat Limkonchotiwat

类目:Computation and Language (cs.CL)

关键词:core NLP competencies, systematically evaluates large, evaluates large language, core NLP, Natural Language Inference

备注

点击查看摘要

Abstract:We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. this https URL

72. 【2602.18782】MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

链接https://arxiv.org/abs/2602.18782

作者:Chun Yan Ryan Kan,Tommy Tran,Vedant Yadav,Ava Cai,Kevin Zhu,Ruizhe Li,Maheep Chaudhary

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Defending LLMs, jailbreak attacks remains, open challenge, adversarial jailbreak attacks, remains an open

备注

点击查看摘要

Abstract:Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.

73. 【2602.18776】ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

链接https://arxiv.org/abs/2602.18776

作者:Anas Alhumud,Abdulaziz Alhammadi,Muhammad Badruddin Khan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Eastern Arabic-Indic numerals, Western Arabic numerals, evaluating large language, number reading tasks, Eastern Arabic-Indic

备注

点击查看摘要

Abstract:We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.

74. 【2602.18764】he Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol

链接https://arxiv.org/abs/2602.18764

作者:Andreas Schlapbach

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Model Context Protocol, auditable LLM-agent interaction, Context Protocol, Model Context, Schema-Guided Dialogue

备注: 18 sections, 4 figures, 7 tables, 38 references. Original research presenting: (1) formal framework mapping Schema-Guided Dialogue principles to Model Context Protocol concepts, (2) five foundational design principles for LLM-native schema authoring, (3) architectural patterns for secure, scalable agent orchestration. Research supported by SBB (Swiss Federal Railways)

点击查看摘要

Abstract:This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction. SGD, designed for dialogue-based API discovery (2019), and MCP, now the de facto standard for LLM-tool integration, share the same core insight -- that schemas can encode not just tool signatures but operational constraints and reasoning guidance. By analyzing this convergence, we extract five foundational principles for schema design: (1) Semantic Completeness over Syntactic Precision, (2) Explicit Action Boundaries, (3) Failure Mode Documentation, (4) Progressive Disclosure Compatibility, and (5) Inter-Tool Relationship Declaration. These principles reveal three novel insights: first, SGD's original design was fundamentally sound and should be inherited by MCP; second, both frameworks leave failure modes and inter-tool relationships unexploited -- gaps we identify and resolve; third, progressive disclosure emerges as a critical production-scaling insight under real-world token constraints. We provide concrete design patterns for each principle. These principles position schema-driven governance as a scalable mechanism for AI system oversight without requiring proprietary system inspection -- central to Software 3.0.

75. 【2602.18734】Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

链接https://arxiv.org/abs/2602.18734

作者:Lichang Song,Ting Long,Yi Chang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrated strong effectiveness, grounding language generation, external evidence, demonstrated strong, strong effectiveness

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the generation quality of the generator is highly dependent on reranking results of the reranker. To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-makers rather than being connected through an asymmetric dependency pipeline. By jointly optimizing their behaviors toward a shared task objective, the reranker and generator are encouraged to cooperate, ensuring that document reranking and generation work in concert to improve the final response. Experimental results demonstrate good generalization and improved generation stability of CoRAG, even when the model is trained on only around 10K PopQA samples. Our model released in this https URL

76. 【2602.18721】ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

链接https://arxiv.org/abs/2602.18721

作者:Zefang Liu,Chenyang Zhu,Sangwoo Cho,Shi-Xiong Zhang

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Semi-supervised learning, automatic speech recognition, error accumulation due, typically relies, noisy supervision

备注

点击查看摘要

Abstract:Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

77. 【2602.18700】Watermarking LLM Agent Trajectories

链接https://arxiv.org/abs/2602.18700

作者:Wenlong Meng,Chen Gong,Terry Yue Zhuo,Fan Zhang,Kecen Li,Zheng Liu,Zhou Yang,Chengkun Wei,Wenzhi Chen

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:high-capacity model generation, data requires substantial, substantial task design, agents rely heavily, requires substantial task

备注: 20 pages, 9 figures

点击查看摘要

Abstract:LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.

78. 【2602.18699】Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

链接https://arxiv.org/abs/2602.18699

作者:Stephen Russell

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:recursive trajectory instability, report multiple signals, distributional divergence, studies report multiple, shared explanatory theory

备注

点击查看摘要

Abstract:Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them. This paper proposes a formalization of these signals in one time-indexed substrate, $S_t=(X,d_t,P_t)$, combining embedding geometry with local diffusion. Within this substrate, node-level neighborhood drift measures changes in local conditional distributions, coarse Ricci curvature measures local contractivity of semantic diffusion, and recursive drift probes stability of iterated semantic operators. This manuscript specifies the formal model, assumptions, and tests that can refute the model. Herein, the paper introduces bridge mass, a node-level aggregate of incident negative curvature, as a predictor of future neighborhood rewiring. This paper provides the theory and test contracts; empirical performance is deferred to subsequent studies.

79. 【2602.18693】Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM

链接https://arxiv.org/abs/2602.18693

作者:Md Badsha Biswas,Ozlem Uzuner

类目:Computation and Language (cs.CL)

关键词:significant societal risks, pose significant societal, Claim verification, societal risks, digital platforms

备注

点击查看摘要

Abstract:The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and transparency. To address these limitations, we present a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis. Our approach introduces a novel retrieval strategy that collects evidence for both the original and the negated forms of a claim, enabling the system to capture supporting and contradicting information from diverse sources: Wikipedia, PubMed, and Google. These evidence sets are filtered, deduplicated, and aggregated across sources to form a unified and enriched knowledge base that better reflects the complexity of real-world information. This aggregated evidence is then used for claim verification using LLMs. We further enhance interpretability by analyzing model confidence scores to quantify and visualize inter-source disagreement. Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning. Our findings underscore the importance of embracing diversity, contradiction, and aggregation in evidence for building reliable and transparent claim verification systems

80. 【2602.18692】From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions

链接https://arxiv.org/abs/2602.18692

作者:Saif M. Mohammad

类目:Computation and Language (cs.CL)

关键词:future negative outcome, negative outcome, future negative, Anxiety, anxiety associations

备注

点击查看摘要

Abstract:Anxiety is the unease about a possible future negative outcome. In recent years, there has been growing interest in understanding how anxiety relates to our health, well-being, body, mind, and behaviour. This includes work on lexical resources for word-anxiety association. However, there is very little anxiety-related work on larger units of text such as multiword expressions (MWE). Here, we introduce the first large-scale lexicon capturing descriptive norms of anxiety associations for more than 20k English MWEs. We show that the anxiety associations are highly reliable. We use the lexicon to study prevalence of different types of anxiety- and calmness-associated MWEs; and how that varies across two-, three-, and four-word sequences. We also study the extent to which the anxiety association of MWEs is compositional (due to its constituent words). The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. The lexicon is freely available: this https URL

81. 【2602.18671】Spilled Energy in Large Language Models

链接https://arxiv.org/abs/2602.18671

作者:Adrian Robert Minut,Hazem Dewidar,Iacopo Masi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, final Large Language, multiple interacting EBMs, Language Model, Large Language

备注

点击查看摘要

Abstract:We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.

82. 【2602.18652】PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

链接https://arxiv.org/abs/2602.18652

作者:Nina Hosseini-Kivanani

类目:Computation and Language (cs.CL)

关键词:idiomatic expressions due, non-compositional meanings, struggle with idiomatic, idiomatic expressions, expressions due

备注: Accepted at AdMIRe 2 shared task (Advancing Multimodal Idiomaticity Representation) colocated with 22nd Workshop on Multiword Expressions (MWE 2026) @EACL2026

点击查看摘要

Abstract:Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.

83. 【2602.18633】DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

链接https://arxiv.org/abs/2602.18633

作者:Fangyuan Xu,Sihao Chen,Zinan Lin,Taiwei Shi,Sydney Graham,Pei Zhou,Mengting Wan,Alex Stein,Virginia Estellers,Charles Chen,Morris Sharp,Richard Speyer,Tadas Baltrusaitis,Jennifer Neville,Eunsol Choi,Longqi Yang

类目:Computation and Language (cs.CL)

关键词:developing large language, synthetic data, data, large language models, private

备注

点击查看摘要

Abstract:Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.

84. 【2602.18613】Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

链接https://arxiv.org/abs/2602.18613

作者:Baris Arat,Emre Sefer

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Standard reranking evaluations, reranker orders candidates, orders candidates returned, reranking evaluations study, Standard reranking

备注

点击查看摘要

Abstract:Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

85. 【2602.18583】Luna-2: Scalable Single-Token Evaluation with Small Language Models

链接https://arxiv.org/abs/2602.18583

作者:Vatsal Goel,Rishon Dsouza,Nikhil Ega,Amey Ramesh Rambatla,Rob Friel,Shuai Shao,Yash Sheth

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Real-time guardrails require, operationally non-deterministic due, guardrails require evaluation, Real-time guardrails, task-specific LLMAJ metrics

备注

点击查看摘要

Abstract:Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2602.18583 [cs.CL]

(or
arXiv:2602.18583v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.18583

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
86. 【2602.18582】Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

链接https://arxiv.org/abs/2602.18582

作者:Zhiqin Qian,Ryan Diaz,Sangwon Seo,Vaibhav Unhelkar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:training artificial intelligence, artificial intelligence, training artificial, Reward design, Reward

备注: Extended version of an identically-titled paper accepted at AAMAS 2026

点击查看摘要

Abstract:When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.

87. 【2602.18492】Vibe Coding on Trial: Operating Characteristics of Unanimous LLM Juries

链接https://arxiv.org/abs/2602.18492

作者:Muhammad Aziz Ullah,Abdul Serwadda

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Large Language Models, Large Language, workflow increasingly built, GitHub Copilot, plain language

备注: Submitted to IEEE International Conference on Semantic Computing 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are now good enough at coding that developers can describe intent in plain language and let the tool produce the first code draft, a workflow increasingly built into tools like GitHub Copilot, Cursor, and Replit. What is missing is a reliable way to tell which model written queries are safe to accept without sending everything to a human. We study the application of an LLM jury to run this review step. We first benchmark 15 open models on 82 MySQL text to SQL tasks using an execution grounded protocol to get a clean baseline of which models are strong. From the six best models we build unanimous committees of sizes 1 through 6 that see the prompt, schema, and candidate SQL and accept it only when every member says it is correct. This rule matches safety first deployments where false accepts are more costly than false rejects. We measure true positive rate, false positive rate and Youden J and we also look at committees per generator. Our results show that single model judges are uneven, that small unanimous committees of strong models can cut false accepts while still passing many good queries, and that the exact committee composition matters significantly.

88. 【2602.18487】he Million-Label NER: Breaking Scale Barriers with GLiNER bi-encoder

链接https://arxiv.org/abs/2602.18487

作者:Ihor Stepanov,Mykhailo Shtopko,Dmytro Vodianytskyi,Oleksandr Lukashov

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Named Entity Recognition, harmonizes zero-shot flexibility, Named Entity, industrial-scale efficiency, flexibility with industrial-scale

备注: 13 pages, 1 figure, 4 tables

点击查看摘要

Abstract:This paper introduces GLiNER-bi-Encoder, a novel architecture for Named Entity Recognition (NER) that harmonizes zero-shot flexibility with industrial-scale efficiency. While the original GLiNER framework offers strong generalization, its joint-encoding approach suffers from quadratic complexity as the number of entity labels increases. Our proposed bi-encoder design decouples the process into a dedicated label encoder and a context encoder, effectively removing the context-window bottleneck. This architecture enables the simultaneous recognition of thousands, and potentially millions, of entity types with minimal overhead. Experimental results demonstrate state-of-the-art zero-shot performance, achieving 61.5 percent Micro-F1 on the CrossNER benchmark. Crucially, by leveraging pre-computed label embeddings, GLiNER-bi-Encoder achieves up to a 130 times throughput improvement at 1024 labels compared to its uni-encoder predecessors. Furthermore, we introduce GLiNKER, a modular framework that leverages this architecture for high-performance entity linking across massive knowledge bases such as Wikidata.

89. 【2602.18483】Red Teaming LLMs as Socio-Technical Practice: From Exploration and Data Creation to Evaluation

链接https://arxiv.org/abs/2602.18483

作者:Adriana Alvarado Garcia,Ruyuan Wan,Ozioma C. Oguine,Karla Badillo-Urquiola

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, key evaluative approach, red teaming datasets

备注

点击查看摘要

Abstract:Recently, red teaming, with roots in security, has become a key evaluative approach to ensure the safety and reliability of Generative Artificial Intelligence. However, most existing work emphasizes technical benchmarks and attack success rates, leaving the socio-technical practices of how red teaming datasets are defined, created, and evaluated under-examined. Drawing on 22 interviews with practitioners who design and evaluate red teaming datasets, we examine the data practices and standards that underpin this work. Because adversarial datasets determine the scope and accuracy of model evaluations, they are critical artifacts for assessing potential harms from large language models. Our contributions are first, empirical evidence of practitioners conceptualizing red teaming and developing and evaluating red teaming datasets. Second, we reflect on how practitioners' conceptualization of risk leads to overlooking the context, interaction type, and user specificity. We conclude with three opportunities for HCI researchers to expand the conceptualization and data practices for red-teaming.

90. 【2602.18468】he Algorithmic Unconscious: Structural Mechanisms and Implicit Biases in Large Language Models

链接https://arxiv.org/abs/2602.18468

作者:Philippe Boisnard

类目:Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词:large language models, article introduces, introduces the concept, algorithmic unconscious, unconscious to designate

备注: 18 pages, 5 figures, Extended version of a paper presented at the international conference 'Artificial Intelligence and Transformations of Information' (LOGOS/FLSH, Hassan II University of Casablanca, Morocco, December 2025), accepted for publication in LOGOS after double-blind peer review

点击查看摘要

Abstract:This article introduces the concept of the algorithmic unconscious to designate the set of structural determinations that operate within large language models (LLMs) without being accessible either to the model's own reflexivity or to that of its users. In contrast to approaches that reduce AI bias solely to dataset composition or to the projection of human intentionality, we argue that a significant class of biases emerges directly from the technical mechanisms of the models themselves: tokenization, attention, statistical optimization, and alignment procedures. By framing bias as an infrastructural phenomenon, this approach resolves a central theoretical ambiguity surrounding responsibility, neutrality, and correction in contemporary LLMs. Based on a comparative analysis of tokenization across a corpus of parallel sentences, we show that Arabic languages (Modern Standard Arabic and Maghrebi dialects) undergo a systematic inflation in token count relative to English, with ratios ranging from 1.6x to nearly 4x depending on the infrastructure (OpenAI, Anthropic, SentencePiece/Mistral). This over-segmentation constitutes a measurable infrastructural bias that mechanically increases inference costs, constrains access to contextual space, and alters attentional weighting within model representations. We relate these empirical findings to three additional structural mechanisms: causal bias (correlation vs causation), the erasure of minoritized features through dimensional collapse, and normative biases induced by safety alignment. Finally, we propose a framework for a technical clinic of models, grounded in the audit of tokenization regimes, latent space topology, and alignment systems, as a necessary condition for the critical appropriation of AI infrastructures.

91. 【2602.18464】How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

链接https://arxiv.org/abs/2602.18464

作者:Yuxuan Li,Leyang Li,Hao-Ping(Hank)Lee,Sauvik Das

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:people form attitudes, large language model, growing body, body of research, research assumes

备注

点击查看摘要

Abstract:A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (SP) threats. If correct, these simulations could offer a scalable way to forecast SP risks in products prior to deployment. We interrogate this assumption using SP-ABCBench, a new benchmark of 30 tests derived from validated SP human-subject studies, which measures alignment between simulations and human-subjects studies on a 0-100 ascending scale, where higher scores indicate better alignment across three dimensions: Attitude, Behavior, and Coherence. Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, we found that there remains substantial room for improvement: all models score between 50 and 64 on average. Newer, bigger, and smarter models do not reliably do better and sometimes do worse. Some simulation configurations, however, do yield high alignment: e.g., with scores above 95 for some behavior tests when agents are prompted to apply bounded rationality and weigh privacy costs against perceived benefits. We release SP-ABCBench to enable reproducible evaluation as methods improve.

92. 【2602.18458】he Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research

链接https://arxiv.org/abs/2602.18458

作者:Xiaoyan Bai,Alexander Baumgartner,Haojia Sun,Ari Holtzman,Chenhao Tan

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:paper-centric review system, crises across sciences, sciences highlight, highlight the limitations, system in assessing

备注: Our code is available, see [this https URL](https://github.com/ChicagoHAI/MechEvalAgent/)

点击查看摘要

Abstract:Reproducibility crises across sciences highlight the limitations of the paper-centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution-grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.

93. 【2602.18454】Exploring the Ethical Concerns in User Reviews of Mental Health Apps using Topic Modeling and Sentiment Analysis

链接https://arxiv.org/abs/2602.18454

作者:Mohammad Masudur Rahman,Beenish Moalla Chaudhry

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Google Play Store, Apple App Store, health mobile apps, rapid growth, growth of AI-driven

备注: 22 pages, journal-ready version

点击查看摘要

Abstract:The rapid growth of AI-driven mental health mobile apps has raised concerns about their ethical considerations and user trust. This study proposed a natural language processing (NLP)-based framework to evaluate ethical aspects from user-generated reviews from the Google Play Store and Apple App Store. After gathering and cleaning the data, topic modeling was applied to identify latent themes in the context of ethics using topic words and then map them to well-recognized existing ethical principles described in different ethical frameworks; in addition to that, a bottom-up approach is applied to find any new and emergent ethics from the reviews using a transformer-based zero-shot classification model. Sentiment analysis was then used to capture how users feel about each ethical aspect. The obtained results reveal that well-known ethical considerations are not enough for the modern AI-based technologies and are missing emerging ethical challenges, showing how these apps either uphold or overlook key moral values. This work contributes to developing an ongoing evaluation system that can enhance the fairness, transparency, and trustworthiness of AI-powered mental health chatbots.

94. 【2602.18450】Asymptotic Semantic Collapse in Hierarchical Optimization

链接https://arxiv.org/abs/2602.18450

作者:Faruk Alpay,Bugra Kilictas

类目:Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)

关键词:Multi-agent language systems, yielding near-uniform behavior, progressively absorbs individual, absorbs individual semantics, Multi-agent language

备注: 23 pages, 2 figures. Includes a dataset-free benchmark with full metric reporting

点击查看摘要

Abstract:Multi-agent language systems can exhibit a failure mode where a shared dominant context progressively absorbs individual semantics, yielding near-uniform behavior across agents. We study this effect under the name Asymptotic Semantic Collapse in Hierarchical Optimization. In a closed linguistic setting with a Dominant Anchor Node whose semantic state has effectively infinite inertia, we show that repeated interactions with Peripheral Agent Nodes drive an asymptotic alignment that minimizes a global loss. We model semantic states as points on a Riemannian manifold and analyze the induced projection dynamics. Two consequences follow. First, the limiting semantic configuration is insensitive to the optimization history: both smooth gradient-style updates and stochastic noisy updates converge to the same topological endpoint, establishing path independence at convergence. Second, the degree of context dependence controls information content: moving from atomic (independent) representations to fully entangled (context-bound) representations forces the node entropy, interpreted as available degrees of freedom, to vanish in the limit. The theory connects information-theoretic quantities with differential-geometric structure and suggests an interpretation as an immutable consensus rule that constrains agents to a shared semantic grammar. A lightweight dataset-free benchmark on an RWKV-7 13B GGUF checkpoint complements the analysis, reporting zero hash collisions, mean compliance of 0.50 under greedy decoding and 0.531 under stochastic decoding, and final Jaccard-to-anchor similarity values of 0.295 and 0.224, respectively.

95. 【2602.18449】Prompt Optimization Via Diffusion Language Models

链接https://arxiv.org/abs/2602.18449

作者:Shiyu Wang,Haolin Chen,Liangwei Yang,Jielin Qiu,Rithesh Murthy,Ming Zhu,Zixiang Chen,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:iteratively refine system, leverages Diffusion Language, Diffusion Language Models, downstream language model, refine system prompts

备注

点击查看摘要

Abstract:We propose a diffusion-based framework for prompt optimization that leverages Diffusion Language Models (DLMs) to iteratively refine system prompts through masked denoising. By conditioning on interaction traces, including user queries, model responses, and optional feedback, our method enables flexible, span-level prompt updates without requiring gradient access or modifying the downstream language model. Across diverse benchmarks (e.g., $\tau$-bench, SST-2, SST-5), DLM-optimized prompts consistently improve the performance of a frozen target LLM (e.g., GPT-4o-mini). We further show that moderate diffusion step counts provide the best balance between refinement quality and stability. These results highlight diffusion-based prompt optimization as a general, model-agnostic, and scalable approach for enhancing LLM performance through iterative prompt refinement.

96. 【2602.18448】INSURE-Dial: A Phase-Aware Conversational Dataset \ Benchmark for Compliance Verification and Phase Detection

链接https://arxiv.org/abs/2602.18448

作者:Shubham Kulkarni,Alexander Lyzhov,Preetam Joshi,Shiva Chaitanya

类目:Computation and Language (cs.CL)

关键词:trillion USD annually, Administrative phone tasks, million insurance-benefit verification, trillion USD, Administrative phone

备注: Accepted to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

点击查看摘要

Abstract:Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.

97. 【2602.18447】ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification

链接https://arxiv.org/abs/2602.18447

作者:Siran Liu,Cyril Y. He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:high inference latency, inference latency due, incurs high inference, long generation traces, reasoning significantly improves

备注

点击查看摘要

Abstract:Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24$\times$ end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.

98. 【2602.18446】ReportLogic: Evaluating Logical Quality in Deep Research Reports

链接https://arxiv.org/abs/2602.18446

作者:Jujia Zhao,Zhaoxin Huan,Zihan Wang,Xiaolu Zhang,Jun Zhou,Suzan Verberne,Zhaochun Ren

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Users increasingly rely, Language Models, Deep Research, Large Language

备注

点击查看摘要

Abstract:Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.

99. 【2602.18443】From "Help" to Helpful: A Hierarchical Assessment of LLMs in Mental e-Health Applications

链接https://arxiv.org/abs/2602.18443

作者:Philipp Steigerwald,Jens Albrecht

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:efficient case prioritisation, Psychosocial online counselling, frequently encounters generic, impede efficient case, encounters generic subject

备注

点击查看摘要

Abstract:Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff's $\alpha$, Spearman's $\rho$, Pearson's $r$ and Kendall's $\tau$. Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.

100. 【2602.18915】AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting

链接https://arxiv.org/abs/2602.18915

作者:Mohammadreza Ghaffarzadeh-Esfahani,Yousof Gheisari

类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:native serotypes face, serotypes face limitations, Adeno-associated viruses, immune evasion, gene therapy

备注: 22 pages, 6 figures, and 5 supplementary files. Corresponding author: ygheisari@med. [this http URL](http://mui.ac.ir) , Kaggle notebook is available at [this https URL](https://www.kaggle.com/code/mohammadgh009/aavgen)

点击查看摘要

Abstract:Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.

101. 【2602.18899】[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

链接https://arxiv.org/abs/2602.18899

作者:Kwanghee Choi,Eunjung Yeo,Cheol Jun Cho,David Harwath,David R. Mortensen

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:structured remains underexplored, rich phonetic information, Self-supervised speech models, encode rich phonetic, Self-supervised speech

备注: Submitted to ACL, code planned to release after acceptance

点击查看摘要

Abstract:Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at this https URL .

信息检索

1. 【2602.20135】KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

链接https://arxiv.org/abs/2602.20135

作者:Mohammad Amanlou,Erfan Shafiee Moghaddam,Yasaman Amou Jafari,Mahdi Noori,Farhan Farsi,Behnam Bahrak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, rise of large, large language, instrumental in applications, RAG

备注: Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

点击查看摘要

Abstract:With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

2. 【2602.20122】NanoKnow: How to Know What Your Language Model Knows

链接https://arxiv.org/abs/2602.20122

作者:Lingwei Gu,Nour Jedidi,Jimmy Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large language models, large language, pre-training data, pre-training, language models

备注

点击查看摘要

Abstract:How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at this https URL.

3. 【2602.20093】ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

链接https://arxiv.org/abs/2602.20093

作者:Kun Yang,Yuxuan Zhu,Yazhe Chen,Siyao Zheng,Bangyang Hong,Kangle Wu,Yabo Ni,Anxiang Zeng,Cong Fu,Hui Li

类目:Information Retrieval (cs.IR)

关键词:Sequential recommendation increasingly, recommendation increasingly employs, increasingly employs latent, employs latent multi-step, enhance test-time computation

备注: 15 pages, 7 figures

点击查看摘要

Abstract:Sequential recommendation increasingly employs latent multi-step reasoning to enhance test-time computation. Despite empirical gains, existing approaches largely drive intermediate reasoning states via target-dominant objectives without imposing explicit feasibility constraints. This results in latent drift, where reasoning trajectories deviate into implausible regions. We argue that effective recommendation reasoning should instead be viewed as navigation on a collaborative manifold rather than free-form latent refinement. To this end, we propose ManCAR (Manifold-Constrained Adaptive Reasoning), a principled framework that grounds reasoning within the topology of a global interaction graph. ManCAR constructs a local intent prior from the collaborative neighborhood of a user's recent actions, represented as a distribution over the item simplex. During training, the model progressively aligns its latent predictive distribution with this prior, forcing the reasoning trajectory to remain within the valid manifold. At test time, reasoning proceeds adaptively until the predictive distribution stabilizes, avoiding over-refinement. We provide a variational interpretation of ManCAR to theoretically validate its drift-prevention and adaptive test-time stopping mechanisms. Experiments on seven benchmarks demonstrate that ManCAR consistently outperforms state-of-the-art baselines, achieving up to a 46.88% relative improvement w.r.t. NDCG@10. Our code is available at this https URL.

4. 【2602.20001】FairFS: Addressing Deep Feature Selection Biases for Recommender System

链接https://arxiv.org/abs/2602.20001

作者:Xianquan Wang,Zhaocheng Du,Jieming Zhu,Qinglin Jia,Zhenhua Dong,Kai Zhang

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large-scale online marketplaces, recommender systems serve, critical technological support, Large-scale online, recommender systems

备注: Accepted by The Web Conference 2026

点击查看摘要

Abstract:Large-scale online marketplaces and recommender systems serve as critical technological support for e-commerce development. In industrial recommender systems, features play vital roles as they carry information for downstream models. Accurate feature importance estimation is critical because it helps identify the most useful feature subsets from thousands of feature candidates for online services. Such selection enables improved online performance while reducing computational cost. To address feature selection problems in deep learning, trainable gate-based and sensitivity-based methods have been proposed and proven effective in industrial practice. However, through the analysis of real-world cases, we identified three bias issues that cause feature importance estimation to rely on partial model layers, samples, or gradients, ultimately leading to inaccurate importance estimation. We refer to these as layer bias, baseline bias, and approximation bias. To mitigate these issues, we propose FairFS, a fair and accurate feature selection algorithm. FairFS regularizes feature importance estimated across all nonlinear transformation layers to address layer bias. It also introduces a smooth baseline feature close to the classifier decision boundary and adopts an aggregated approximation method to alleviate baseline and approximation biases. Extensive experiments demonstrate that FairFS effectively mitigates these biases and achieves state-of-the-art feature selection performance.

5. 【2602.19990】A Context-Aware Knowledge Graph Platform for Stream Processing in Industrial IoT

链接https://arxiv.org/abs/2602.19990

作者:Monica Marconi Sciarroni,Emanuele Storti

类目:Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)

关键词:Industrial IoT ecosystems, IoT ecosystems bring, smart devices operating, devices operating collaboratively, bring together sensors

备注

点击查看摘要

Abstract:Industrial IoT ecosystems bring together sensors, machines and smart devices operating collaboratively across industrial environments. These systems generate large volumes of heterogeneous, high-velocity data streams that require interoperable, secure and contextually aware management. Most of the current stream management architectures, however, still rely on syntactic integration mechanisms, which result in limited flexibility, maintainability and interpretability in complex Industry 5.0 scenarios. This work proposes a context-aware semantic platform for data stream management that unifies heterogeneous IoT/IoE data sources through a Knowledge Graph enabling formal representation of devices, streams, agents, transformation pipelines, roles and rights. The model supports flexible data gathering, composable stream processing pipelines, and dynamic role-based data access based on agents' contexts, relying on Apache Kafka and Apache Flink for real-time processing, while SPARQL and SWRL-based reasoning provide context-dependent stream discovery. Experimental evaluations demonstrate the effectiveness of combining semantic models, context-aware reasoning and distributed stream processing to enable interoperable data workflows for Industry 5.0 environments.

6. 【2602.19987】Counterfactual Understanding via Retrieval-aware Multimodal Modeling for Time-to-Event Survival Prediction

链接https://arxiv.org/abs/2602.19987

作者:Ha-Anh Hoang Nguyen,Tri-Duc Phan Le,Duc-Hoang Pham,Huy-Son Nguyen,Cam-Van Thi Nguyen,Duc-Trong Le,Hoang-Quynh Le

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:optimize individualized survival, individualized survival outcomes, counterfactual survival prediction, aiming to optimize, censored data

备注

点击查看摘要

Abstract:This paper tackles the problem of time-to-event counterfactual survival prediction, aiming to optimize individualized survival outcomes in the presence of heterogeneity and censored data. We propose CURE, a framework that advances counterfactual survival modeling via comprehensive multimodal embedding and latent subgroup retrieval. CURE integrates clinical, paraclinical, demographic, and multi-omics information, which are aligned and fused through cross-attention mechanisms. Complex multi-omics signals can be adaptively refined using a mixture-of-experts architecture, emphasizing the most informative omics components. Building upon this representation, CURE implicitly retrieves patient-specific latent subgroups that capture both baseline survival dynamics and treatment-dependent variations. Experimental results on METABRIC and TCGA-LUAD datasets demonstrate that proposed CURE model consistently outperforms strong baselines in survival analysis, evaluated using the Time-dependent Concordance Index ($C^{td}$) and Integrated Brier Score (IBS). These findings highlight the potential of CURE to enhance multimodal understanding and serve as a foundation for future treatment recommendation models. All code and related resources are publicly available to facilitate the reproducibility this https URL.

7. 【2602.19961】Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

链接https://arxiv.org/abs/2602.19961

作者:Yibo Yan,Jiahao Huo,Guanbo Feng,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Yuanhuiyi Lyu,Yu Huang,Jungang Li,Kening Zheng,Xu Zheng,Philip S. Yu,James Kwok,Xuming Hu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:precise information acquisition, unstructured visually rich, visually rich data, Visual Document Retrieval, visual documents exhibit

备注: Under review

点击查看摘要

Abstract:With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.

8. 【2602.19778】Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

链接https://arxiv.org/abs/2602.19778

作者:Nghia Phan,Rong Jin,Gang Liu,Xiao Dong

类目:ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Automatic Chord Recognition, Automatic Chord, Chord Recognition, costly to acquire, scarcity of aligned

备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.

9. 【2602.19728】GrIT: Group Informed Transformer for Sequential Recommendation

链接https://arxiv.org/abs/2602.19728

作者:Adamya Shyam,Venkateswara Rao Kagita,Bharti Rana,Vikas Kumar

类目:Information Retrieval (cs.IR)

关键词:recommender systems aim, extracting temporal patterns, Sequential recommender systems, user future interests, recommender systems

备注

点击查看摘要

Abstract:Sequential recommender systems aim to predict a user's future interests by extracting temporal patterns from their behavioral history. Existing approaches typically employ transformer-based architectures to process long sequences of user interactions, capturing preference shifts by modeling temporal relationships between items. However, these methods often overlook the influence of group-level features that capture the collective behavior of similar users. We hypothesize that explicitly modeling temporally evolving group features alongside individual user histories can significantly enhance next-item recommendation. Our approach introduces latent group representations, where each user's affiliation to these groups is modeled through learnable, time-varying membership weights. The membership weights at each timestep are computed by modeling shifts in user preferences through their interaction history, where we incorporate both short-term and long-term user preferences. We extract a set of statistical features that capture the dynamics of user behavior and further refine them through a series of transformations to produce the final drift-aware membership weights. A group-based representation is derived by weighting latent group embeddings with the learned membership scores. This representation is integrated with the user's sequential representation within the transformer block to jointly capture personal and group-level temporal dynamics, producing richer embeddings that lead to more accurate, context-aware recommendations. We validate the effectiveness of our approach through extensive experiments on five benchmark datasets, where it consistently outperforms state-of-the-art sequential recommendation methods.

10. 【2602.19711】A Three-stage Neuro-symbolic Recommendation Pipeline for Cultural Heritage Knowledge Graphs

链接https://arxiv.org/abs/2602.19711

作者:Krzysztof Kutt,Elżbieta Sroka,Oleksandra Ishchuk,Luiz do Valle Miranda

类目:Information Retrieval (cs.IR); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)

关键词:digital cultural heritage, cultural heritage resources, heritage resources highlights, interpreting semantic relationships, advanced recommendation methods

备注: 15 pages, 1 figure; submitted to ICCS 2026 conference

点击查看摘要

Abstract:The growing volume of digital cultural heritage resources highlights the need for advanced recommendation methods capable of interpreting semantic relationships between heterogeneous data entities. This paper presents a complete methodology for implementing a hybrid recommendation pipeline integrating knowledge-graph embeddings, approximate nearest-neighbour search, and SPARQL-driven semantic filtering. The work is evaluated on the JUHMP (Jagiellonian University Heritage Metadata Portal) knowledge graph developed within the CHExRISH project, which at the time of experimentation contained ${\approx}3.2$M RDF triples describing people, events, objects, and historical relations affiliated with the Jagiellonian University (Kraków, PL). We evaluate four embedding families (TransE, ComplEx, ConvE, CompGCN) and perform hyperparameter selection for ComplEx and HNSW. Then, we present and evaluate the final three-stage neuro-symbolic recommender. Despite sparse and heterogeneous metadata, the approach produces useful and explainable recommendations, which were also proven with expert evaluation.

11. 【2602.19702】DReX: An Explainable Deep Learning-based Multimodal Recommendation Framework

链接https://arxiv.org/abs/2602.19702

作者:Adamya Shyam,Venkateswara Rao Kagita,Bharti Rana,Vikas Kumar

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:recommender systems leverage, systems leverage diverse, diverse data sources, Multimodal recommender systems, leverage diverse data

备注

点击查看摘要

Abstract:Multimodal recommender systems leverage diverse data sources, such as user interactions, content features, and contextual information, to address challenges like cold-start and data sparsity. However, existing methods often suffer from one or more key limitations: processing different modalities in isolation, requiring complete multimodal data for each interaction during training, or independent learning of user and item representations. These factors contribute to increased complexity and potential misalignment between user and item embeddings. To address these challenges, we propose DReX, a unified multimodal recommendation framework that incrementally refines user and item representations by leveraging interaction-level features from multimodal feedback. Our model employs gated recurrent units to selectively integrate these fine-grained features into global representations. This incremental update mechanism provides three key advantages: (1) simultaneous modeling of both nuanced interaction details and broader preference patterns, (2) eliminates the need for separate user and item feature extraction processes, leading to enhanced alignment in their learned representation, and (3) inherent robustness to varying or missing modalities. We evaluate the performance of the proposed approach on three real-world datasets containing reviews and ratings as interaction modalities. By considering review text as a modality, our approach automatically generates interpretable keyword profiles for both users and items, which supplement the recommendation process with interpretable preference indicators. Experiment results demonstrate that our approach outperforms state-of-the-art methods across all evaluated datasets.

12. 【2602.19698】Iconographic Classification and Content-Based Recommendation for Digitized Artworks

链接https://arxiv.org/abs/2602.19698

作者:Krzysztof Kutt,Maciej Baczyński

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:selected artificial intelligence, automates iconographic classification, artificial intelligence methods, system that automates, automates iconographic

备注: 14 pages, 7 figures; submitted to ICCS 2026 conference

点击查看摘要

Abstract:We present a proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods. The prototype implements a four-stage workflow for classification and recommendation, which integrates YOLOv8 object detection with algorithmic mappings to Iconclass codes, rule-based inference for abstract meanings, and three complementary recommenders (hierarchical proximity, IDF-weighted overlap, and Jaccard similarity). Although more engineering is still needed, the evaluation demonstrates the potential of this solution: Iconclass-aware computer vision and recommendation methods can accelerate cataloging and enhance navigation in large heritage repositories. The key insight is to let computer vision propose visible elements and to use symbolic structures (Iconclass hierarchy) to reach meaning.

13. 【2602.19549】Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

链接https://arxiv.org/abs/2602.19549

作者:Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Jiahao Huo,Shuliang Liu,James Kwok,Xuming Hu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Visual Document Retrieval, multimodal retrieval applications, Visual Document, current multimodal retrieval, retrieve relevant pages

备注: Under review

点击查看摘要

Abstract:Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

14. 【2602.19543】Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

链接https://arxiv.org/abs/2602.19543

作者:Rizhuo Huang,Yifan Feng,Rundong Xue,Shihui Ying,Jun-Hai Yong,Chuan Shi,Shaoyi Du,Yue Gao

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:ary atomic facts, hypergraphs surpass traditional, surpass traditional binary, Knowledge hypergraphs surpass, ary atomic

备注

点击查看摘要

Abstract:Knowledge hypergraphs surpass traditional binary knowledge graphs by encapsulating complex $n$-ary atomic facts, providing a more comprehensive paradigm for semantic representation. However, constructing high-quality hypergraphs remains challenging due to the \textit{scenario gap}: generic extractors struggle to generalize across diverse domains with specific jargon, while existing methods often fail to balance structural skeletons with fine-grained details. To bridge this gap, we propose \textbf{Hyper-KGGen}, a skill-driven framework that reformulates extraction as a dynamic skill-evolving process. First, Hyper-KGGen employs a \textit{coarse-to-fine} mechanism to systematically decompose documents, ensuring full-dimensional coverage from binary links to complex hyperedges. Crucially, it incorporates an \textit{adaptive skill acquisition} module that actively distills domain expertise into a Global Skill Library. This is achieved via a stability-based feedback loop, where extraction stability serves as a relative reward signal to induce high-quality skills from unstable traces and missed predictions. Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction. Experiments demonstrate that Hyper-KGGen significantly outperforms strong baselines, validating that evolved skills provide substantially richer guidance than static few-shot examples in multi-scenario settings.

15. 【2602.19339】SplitLight: An Exploratory Toolkit for Recommender Systems Datasets and Splits

链接https://arxiv.org/abs/2602.19339

作者:Anna Volodkevich,Dmitry Anikin,Danil Gusak,Anton Klenitskiy,Evgeny Frolov,Alexey Vasilev

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:affected by hidden, under-documented choices, data preparation, choices in data, Offline evaluation

备注

点击查看摘要

Abstract:Offline evaluation of recommender systems is often affected by hidden, under-documented choices in data preparation. Seemingly minor decisions in filtering, handling repeats, cold-start treatment, and splitting strategy design can substantially reorder model rankings and undermine reproducibility and cross-paper comparability. In this paper, we introduce SplitLight, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make these decisions measurable, comparable, and reportable. Given an interaction log and derived split subsets, SplitLight analyzes core and temporal dataset statistics, characterizes repeat consumption patterns and timestamp anomalies, and diagnoses split validity, including temporal leakage, cold-user/item exposure, and distribution shifts. SplitLight further allows side-by-side comparison of alternative splitting strategies through comprehensive aggregated summaries and interactive visualizations. Delivered as both a Python toolkit and an interactive no-code interface, SplitLight produces audit summaries that justify evaluation protocols and support transparent, reliable, and comparable experimentation in recommender systems research and industry.

Subjects:

Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2602.19339 [cs.IR]

(or
arXiv:2602.19339v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.19339

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2602.19333】PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

链接https://arxiv.org/abs/2602.19333

作者:Isun Chehreh,Ebrahim Ansari

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:well-balanced Persian social, Persian social media, specifically designed, Persian social, designed to address

备注: 10 pages, including 1 figure

点击查看摘要

Abstract:This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

17. 【2602.19317】Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

链接https://arxiv.org/abs/2602.19317

作者:Maryam Amirizaniani,Alireza Salemi,Hamed Zamani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Question Answering, requires answers, users' background, accurate and aligned, aligned with users'

备注

点击查看摘要

Abstract:Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.

18. 【2602.19183】SIDEKICK: A Semantically Integrated Resource for Drug Effects, Indications, and Contraindications

链接https://arxiv.org/abs/2602.19183

作者:Mohammad Ashhad,Olga Mashkova,Ricardo Henao,Robert Hoehndorf

类目:Information Retrieval (cs.IR)

关键词:guide medical practice, clinical decision support, decision support systems, support systems utilize, systems utilize structured

备注

点击查看摘要

Abstract:Pharmacovigilance and clinical decision support systems utilize structured drug safety data to guide medical practice. However, existing datasets frequently depend on terminologies such as MedDRA, which limits their semantic reasoning capabilities and their interoperability with Semantic Web ontologies and knowledge graphs. To address this gap, we developed SIDEKICK, a knowledge graph that standardizes drug indications, contraindications, and adverse reactions from FDA Structured Product Labels. We developed and used a workflow based on Large Language Model (LLM) extraction and Graph-Retrieval Augmented Generation (Graph RAG) for ontology mapping. We processed over 50,000 drug labels and mapped terms to the Human Phenotype Ontology (HPO), the MONDO Disease Ontology, and RxNorm. Our semantically integrated resource outperforms the SIDER and ONSIDES databases when applied to the task of drug repurposing by side effect similarity. We serialized the dataset as a Resource Description Framework (RDF) graph and employed the Semanticscience Integrated Ontology (SIO) as upper level ontology to further improve interoperability. Consequently, SIDEKICK enables automated safety surveillance and phenotype-based similarity analysis for drug repurposing.

19. 【2602.19040】Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

链接https://arxiv.org/abs/2602.19040

作者:Jiaxin Wu,Xiao-Yong Wei,Qing Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:large language models, short-form video platforms, multimodal large language, language models, rise of short-form

备注

点击查看摘要

Abstract:The rise of short-form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero-shot text-to-video retrieval systems. While recent advances in large-scale pretraining have improved zero-shot cross-modal alignment, existing methods still struggle with query-dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi-agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero-shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries and recovering performance for those that degrade over iterations. These agents are dynamically coordinated by an orchestration agent, which leverages intermediate feedback and reasoning outcomes to guide execution. We also introduce a novel communication mechanism that incorporates retrieval-performance memory and historical reasoning traces to improve coordination and decision-making. Experiments on three TRECVid benchmarks spanning eight years show that our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.

20. 【2602.18962】NeuroWise: A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners

链接https://arxiv.org/abs/2602.18962

作者:Albert Tang,Yifan Mo,Jie Li,Yue Su,Mengyuan Zhang,Sander L. Koole,Koen Hindriks,Jiahuan Pei

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:double empathy problem, empathy problem frames, problem frames communication, frames communication difficulties, double empathy

备注: Accepted to ACM CHI 2026

点击查看摘要

Abstract:The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent LLM-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize communication challenges as mutual.

21. 【2602.18929】Give Users the Wheel: Towards Promptable Recommendation Paradigm

链接https://arxiv.org/abs/2602.18929

作者:Fuyuan Lyu,Chenglin Luo,Qiyuan Zhang,Yupeng Hou,Haolun Wu,Xing Tang,Xue Liu,Jin L.C. Guo,Xiuqiang He

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:implicit behavioral patterns, achieved remarkable success, mining implicit behavioral, Large Language Models, sequential recommendation models

备注

点击查看摘要

Abstract:Conventional sequential recommendation models have achieved remarkable success in mining implicit behavioral patterns. However, these architectures remain structurally blind to explicit user intent: they struggle to adapt when a user's immediate goal (e.g., expressed via a natural language prompt) deviates from their historical habits. While Large Language Models (LLMs) offer the semantic reasoning to interpret such intent, existing integration paradigms force a dilemma: LLM-as-a-recommender paradigm sacrifices the efficiency and collaborative precision of ID-based retrieval, while Reranking methods are inherently bottlenecked by the recall capabilities of the underlying model. In this paper, we propose Decoupled Promptable Sequential Recommendation (DPR), a model-agnostic framework that empowers conventional sequential backbones to natively support Promptable Recommendation, the ability to dynamically steer the retrieval process using natural language without abandoning collaborative signals. DPR modulates the latent user representation directly within the retrieval space. To achieve this, we introduce a Fusion module to align the collaborative and semantic signals, a Mixture-of-Experts (MoE) architecture that disentangles the conflicting gradients from positive and negative steering, and a three-stage training strategy that progressively aligns the semantic space of prompts with the collaborative space. Extensive experiments on real-world datasets demonstrate that DPR significantly outperforms state-of-the-art baselines in prompt-guided tasks while maintaining competitive performance in standard sequential recommendation scenarios.

22. 【2602.18786】CaliCausalRank: Calibrated Multi-Objective Ad Ranking with Robust Counterfactual Utility Optimization

链接https://arxiv.org/abs/2602.18786

作者:Xikai Yang,Sebastian Sun,Yilin Li,Yue Xing,Ming Wang,Yang Wang

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:including click-through rate, simultaneously optimize multiple, user experience metrics, optimize multiple objectives, multiple objectives including

备注

点击查看摘要

Abstract:Ad ranking systems must simultaneously optimize multiple objectives including click-through rate (CTR), conversion rate (CVR), revenue, and user experience metrics. However, production systems face critical challenges: score scale inconsistency across traffic segments undermines threshold transferability, and position bias in click logs causes offline-online metric discrepancies. We propose CaliCausalRank, a unified framework that integrates training-time scale calibration, constraint-based multi-objective optimization, and robust counterfactual utility estimation. Our approach treats score calibration as a first-class training objective rather than post-hoc processing, employs Lagrangian relaxation for constraint satisfaction, and utilizes variance-reduced counterfactual estimators for reliable offline evaluation. Experiments on the Criteo and Avazu datasets demonstrate that CaliCausalRank achieves 1.1% relative AUC improvement, 31.6% calibration error reduction, and 3.2% utility gain compared to the best baseline (PairRank) while maintaining consistent performance across different traffic segments.

23. 【2602.18759】owards Reliable Negative Sampling for Recommendation with Implicit Feedback via In-Community Popularity

链接https://arxiv.org/abs/2602.18759

作者:Chen Chen,Haobo Lin,Yuanbo Xu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:explicit negative signals, modern recommender systems, negative sampling, signals are unavailable, implicit feedback

备注: 12 pages, 9 figures

点击查看摘要

Abstract:Learning from implicit feedback is a fundamental problem in modern recommender systems, where only positive interactions are observed and explicit negative signals are unavailable. In such settings, negative sampling plays a critical role in model training by constructing negative items that enable effective preference learning and ranking optimization. However, designing reliable negative sampling strategies remains challenging, as they must simultaneously ensure realness, hardness, and interpretability. To this end, we propose \textbf{ICPNS (In-Community Popularity Negative Sampling)}, a novel framework that leverages user community structure to identify reliable and informative negative samples. Our approach is grounded in the insight that item exposure is driven by latent user communities. By identifying these communities and utilizing in-community popularity, ICPNS effectively approximates the probability of item exposure. Consequently, items that are popular within a user's community but remain unclicked are identified as more reliable true negatives. Extensive experiments on four benchmark datasets demonstrate that ICPNS yields consistent improvements on graph-based recommenders and competitive performance on MF-based models, outperforming representative negative sampling strategies under a unified evaluation protocol.

24. 【2602.18650】NutriOrion: A Hierarchical Multi-Agent Framework for Personalized Nutrition Intervention Grounded in Clinical Guidelines

链接https://arxiv.org/abs/2602.18650

作者:Junwei Wu,Runze Yan,Hanqi Luo,Darren Liu,Minxiao Wang,Kimberly L. Townsend,Lydia S. Hartwig,Derek Milketinas,Xiao Hu,Carl Yang

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Personalized nutrition intervention, improving health outcomes, heterogeneous clinical conditions, Personalized nutrition, health outcomes

备注

点击查看摘要

Abstract:Personalized nutrition intervention for patients with multimorbidity is critical for improving health outcomes, yet remains challenging because it requires the simultaneous integration of heterogeneous clinical conditions, medications, and dietary guidelines. Single-agent large language models (LLMs) often suffer from context overload and attention dilution when processing such high-dimensional patient profiles. We introduce NutriOrion, a hierarchical multi-agent framework with a parallel-then-sequential reasoning topology. NutriOrion decomposes nutrition planning into specialized domain agents with isolated contexts to mitigate anchoring bias, followed by a conditional refinement stage. The framework includes a multi-objective prioritization algorithm to resolve conflicting dietary requirements and a safety constraint mechanism that injects pharmacological contraindications as hard negative constraints during synthesis, ensuring clinical validity by construction rather than post-hoc filtering. For clinical interoperability, NutriOrion maps synthesized insights into the ADIME standard and FHIR R4 resources. Evaluated on 330 stroke patients with multimorbidity, NutriOrion outperforms multiple baselines, including GPT-4.1 and alternative multi-agent architectures. It achieves a 12.1 percent drug-food interaction violation rate, demonstrates strong personalization with negative correlations (-0.26 to -0.35) between patient biomarkers and recommended risk nutrients, and yields clinically meaningful dietary improvements, including a 167 percent increase in fiber and a 27 percent increase in potassium, alongside reductions in sodium (9 percent) and sugars (12 percent).

25. 【2602.18613】Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

链接https://arxiv.org/abs/2602.18613

作者:Baris Arat,Emre Sefer

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Standard reranking evaluations, reranker orders candidates, orders candidates returned, reranking evaluations study, Standard reranking

备注

点击查看摘要

Abstract:Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

26. 【2602.18588】Altar: Structuring Sharable Experimental Data from Early Exploration to Publication

链接https://arxiv.org/abs/2602.18588

作者:William Gaultier,Andrea Lodetti,Ian Coghill,David Colliaux,Maximilian Fleck,Alienor Lahlou

类目:Information Retrieval (cs.IR); Databases (cs.DB)

关键词:active development phase, Data Management Plans, Management Plans included, significant challenge, active development

备注

点击查看摘要

Abstract:Managing the data and metadata during the active development phase of an experimental project presents a significant challenge, particularly in collaborative research. This phase is frequently overlooked in Data Management Plans included in project proposals, despite its important role in ensuring reproducibility and preventing the need for retroactive reconstruction at the time of publication. Here we present Altar, a lightweight, domain-agnostic framework for structuring experimental data from the onset of a project without imposing rigid data models. Altar is built around the Sacred experiment-tracking model and captures experimental (meta)data and structures them. Parameters, metadata, curves and small files are stored in a flexible NoSQL database, while large raw data are maintained in dedicated storage and linked through unique identifiers, ensuring efficiency and traceability. This integration is composable with exiting workflows, allowing integration with minimial disruption of work habits. We document different pathways to use Altar based on users skillset (PhD students, Post-docs, Principal Investigators, Laboratory administrators, System administrators). While getting started with Altar does not require a specialized infrastructure, the framework can be easily deployed on a server and made publicly accessible when scaling up or preparing data for publication. By addressing the dynamic phase of research, Altar provides a practical bridge between exploratory experimentation and FAIR-aligned data sharing.

27. 【2602.18437】FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

链接https://arxiv.org/abs/2602.18437

作者:Yixing Peng,Licheng Zhang,Shancheng Fang,Yi Liu,Peijian Gu,Quan Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:trustworthy Large Language, Large Language Models, Large Language, trustworthy Large, advanced LLMs

备注: 9 pages, 4figures, AAAI2026

点击查看摘要

Abstract:Generating with citations is crucial for trustworthy Large Language Models (LLMs), yet even advanced LLMs often produce mismatched or irrelevant citations. Existing methods over-optimize citation fidelity while overlooking relevance to the user query, which degrades answer quality and robustness in real-world settings with noisy or irrelevant retrieved content. Moreover, the prevailing single-pass paradigm struggles to deliver optimal answers in long-form generation that requiring multiple citations. To address these limitations, we propose FineRef, a framework based on Fine-grained error Reflection, which explicitly teaches the model to self-identify and correct two key citation errors, mismatch and irrelevance, on a per-citation basis. FineRef follows a two-stage training strategy. The first stage instills an "attempt-reflect-correct" behavioral pattern via supervised fine-tuning, using fine-grained and controllable reflection data constructed by specialized lightweight models. An online self-reflective bootstrapping strategy is designed to improve generalization by iteratively enriching training data with verified, self-improving examples. To further enhance the self-reflection and correction capability, the second stage applies process-level reinforcement learning with a multi-dimensional reward scheme that promotes reflection accuracy, answer quality, and correction gain. Experiments on the ALCE benchmark demonstrate that FineRef significantly improves both citation performance and answer accuracy. Our 7B model outperforms GPT-4 by up to 18% in Citation F1 and 4% in EM Recall, while also surpassing the state-of-the-art model across key evaluation metrics. FineRef also exhibits strong generalization and robustness in domain transfer settings and noisy retrieval scenarios.

计算机视觉

1. 【2602.20161】Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

链接https://arxiv.org/abs/2602.20161

作者:Abdelrahman Shaker,Ahmed Heakl,Jaseel Muhammad,Ritesh Thawkar,Omkar Thawakar,Senmao Li,Hisham Cholakkal,Ian Reid,Eric P. Xing,Salman Khan,Fahad Shahbaz Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generate visual content, single architecture, Unified multimodal, understand and generate, Mobile Conditioning Projector

备注: Project page: [this https URL](https://amshaker.github.io/Mobile-O/)

点击查看摘要

Abstract:Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at this https URL

2. 【2602.20160】LRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

链接https://arxiv.org/abs/2602.20160

作者:Chen Wang,Hao Tan,Wang Yifan,Zhiqin Chen,Yuheng Liu,Kalyan Sunkavalli,Sai Bi,Lingjie Liu,Yiwei Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:linear computational complexity, Test-Time Training, propose tttLRM, enable long-context, computational complexity

备注: Accepted by CVPR 2026. Project Page: [this https URL](https://cwchenwang.github.io/tttLRM)

点击查看摘要

Abstract:We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

3. 【2602.20159】A Very Big Video Reasoning Suite

链接https://arxiv.org/abs/2602.20159

作者:Maijunxian Wang,Ruisi Wang,Juyi Lin,Ran Ji,Thaddäus Wiedemer,Qingying Gao,Dezhi Luo,Yaoyao Qian,Lianyu Huang,Zelong Hong,Jiahui Ge,Qianli Ma,Hang He,Yifan Zhou,Lingzi Guo,Lantao Mei,Jiachen Li,Hanwen Xing,Tianqi Zhao,Fengyuan Yu,Weihang Xiao,Yizheng Jiao,Jianheng Hou,Danyang Zhang,Pengcheng Xu,Boyang Zhong,Zehong Zhao,Gaoyun Fang,John Kitaoka,Yile Xu,Hua Xu,Kenton Blacutt,Tin Nguyen,Siyuan Song,Haoran Sun,Shaoyue Wen,Linyang He,Runming Wang,Yanzhi Wang,Mengyue Yang,Ziqiao Ma,Raphaël Millière,Freda Shi,Nuno Vasconcelos,Daniel Khashabi,Alan Yuille,Yilun Du,Ziming Liu,Bo Li,Dahua Lin,Ziwei Liu,Vikash Kumar,Yijiang Li,Lei Yang,Zhongang Cai,Hokin Deng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)

关键词:Rapid progress, Video reasoning, reasoning, reasoning capabilities underexplored, video

备注: Homepage: [this https URL](https://video-reason.com/)

点击查看摘要

Abstract:Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at this https URL .

4. 【2602.20157】Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

链接https://arxiv.org/abs/2602.20157

作者:Zhongxiao Cong,Qitao Zhao,Minsik Jeon,Shubham Tulsiani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reconstruction systems rely, Current feed-forward, reconstruction systems, expensive to obtain, systems rely

备注: CVPR 2026. Project website: [this https URL](https://flow3r-project.github.io/)

点击查看摘要

Abstract:Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.

5. 【2602.20150】Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

链接https://arxiv.org/abs/2602.20150

作者:Wei-Cheng Huang,Jiaheng Han,Xiaohan Ye,Zherong Pan,Kris Hauser

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:policy learning tasks, Estimating simulation-ready scenes, learning tasks, Estimating simulation-ready, real-world observations

备注: 15 pages, 13 figures, in submission

点击查看摘要

Abstract:Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.

6. 【2602.20137】Do Large Language Models Understand Data Visualization Rules?

链接https://arxiv.org/abs/2602.20137

作者:Martin Sinnona,Valentin Bonas,Emmanuel Iarussi,Viviana Siless

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Data visualization rules-derived, trustworthy chart communication, perception-ensure trustworthy chart, Data visualization, rules-derived from decades

备注

点击查看摘要

Abstract:Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 0.15 for some categories) and for outputs generated from technical ASP this http URL constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

7. 【2602.20119】NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

链接https://arxiv.org/abs/2602.20119

作者:Jiahui Fu,Junyu Nan,Lingfeng Sun,Hongyu Li,Jianing Qian,Jennifer L. Barry,Kris Kitani,George Konidaris

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:integrate high-level semantic, high-level semantic reasoning, Solving long-horizon tasks, low-level physical interaction, Solving long-horizon

备注: 25 pages, 15 figures. Project webpage: [this https URL](https://nova-plan.github.io/)

点击查看摘要

Abstract:Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: this https URL

8. 【2602.20114】Benchmarking Unlearning for Vision Transformers

链接https://arxiv.org/abs/2602.20114

作者:Kairan Zhao,Iurie Luca,Peter Triantafillou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:gained strong momentum, widely regarded, critical capability, capability for building, building safe

备注

点击查看摘要

Abstract:Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for VTs. This work is the first to attempt this, benchmarking MU algorithm performance in different VT families (ViT and Swin-T) and at different capacities. The work employs (i) different datasets, selected to assess the impacts of dataset scale and complexity; (ii) different MU algorithms, selected to represent fundamentally different approaches for MU; and (iii) both single-shot and continual unlearning protocols. Additionally, it focuses on benchmarking MU algorithms that leverage training data memorization, since leveraging memorization has been recently discovered to significantly improve the performance of previously SOTA algorithms. En route, the work characterizes how VTs memorize training data relative to CNNs, and assesses the impact of different memorization proxies on performance. The benchmark uses unified evaluation metrics that capture two complementary notions of forget quality along with accuracy on unseen (test) data and on retained data. Overall, this work offers a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs. And, for the first time, it sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline.

9. 【2602.20100】ranscending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine

链接https://arxiv.org/abs/2602.20100

作者:Soumick Chatterjee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:primary rate-limiting step, intelligence to biomedicine, dependence on expert, expert annotation, annotation has long

备注

点击查看摘要

Abstract:The dependence on expert annotation has long constituted the primary rate-limiting step in the application of artificial intelligence to biomedicine. While supervised learning drove the initial wave of clinical algorithms, a paradigm shift towards unsupervised and self-supervised learning (SSL) is currently unlocking the latent potential of biobank-scale datasets. By learning directly from the intrinsic structure of data - whether pixels in a magnetic resonance image (MRI), voxels in a volumetric scan, or tokens in a genomic sequence - these methods facilitate the discovery of novel phenotypes, the linkage of morphology to genetics, and the detection of anomalies without human bias. This article synthesises seminal and recent advances in "learning without labels," highlighting how unsupervised frameworks can derive heritable cardiac traits, predict spatial gene expression in histology, and detect pathologies with performance that rivals or exceeds supervised counterparts.

10. 【2602.20089】StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

链接https://arxiv.org/abs/2602.20089

作者:Zanxi Ruan,Qiuyu Kong,Songqun Gao,Yiming Wang,Marco Cristani

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:early vision research, central today, Edge-based representations, rooted in early, early vision

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: this https URL.

11. 【2602.20084】Do Large Language Models Understand Data Visualization Principles?

链接https://arxiv.org/abs/2602.20084

作者:Martin Sinnona,Valentin Bonas,Viviana Siless,Emmanuel Iarussi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Data visualization principles, ensure proper visual, proper visual communication, Data visualization, ensure proper

备注

点击查看摘要

Abstract:Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.

12. 【2602.20079】SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

链接https://arxiv.org/abs/2602.20079

作者:Xinya Chen,Christopher Wewer,Jiahao Xie,Xinting Hu,Jan Eric Lenssen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:camera-conditioned multi-view diffusion, multi-view diffusion model, improves generation quality, present SemanticNVS, Existing NVS methods

备注

点击查看摘要

Abstract:We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives.

13. 【2602.20068】he Invisible Gorilla Effect in Out-of-distribution Detection

链接https://arxiv.org/abs/2602.20068

作者:Harry Anthony,Ziyun Liang,Hermione Warr,Konstantinos Kamnitsas

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep Neural Networks, Neural Networks achieve, Deep Neural, Neural Networks, Networks achieve high

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: this https URL.

14. 【2602.20066】HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

链接https://arxiv.org/abs/2602.20066

作者:Kundan Thota,Xuanhao Mu,Thorsten Schlachter,Veit Hagenmeyer

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurate heat-demand maps, decarbonizing space heating, heat-demand maps play, municipalities lack detailed, lack detailed building-level

备注

点击查看摘要

Abstract:Accurate heat-demand maps play a crucial role in decarbonizing space heating, yet most municipalities lack detailed building-level data needed to calculate them. We introduce HeatPrompt, a zero-shot vision-language energy modeling framework that estimates annual heat demand using semantic features extracted from satellite images, basic Geographic Information System (GIS), and building-level features. We feed pretrained Large Vision Language Models (VLMs) with a domain-specific prompt to act as an energy planner and extract the visual attributes such as roof age, building density, etc, from the RGB satellite image that correspond to the thermal load. A Multi-Layer Perceptron (MLP) regressor trained on these captions shows an $R^2$ uplift of 93.7% and shrinks the mean absolute error (MAE) by 30% compared to the baseline model. Qualitative analysis shows that high-impact tokens align with high-demand zones, offering lightweight support for heat planning in data-scarce regions.

15. 【2602.20060】MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

链接https://arxiv.org/abs/2602.20060

作者:Junli Wang,Xueyi Liu,Yinan Zheng,Zebing Xing,Pengfei Li,Guang Li,Kun Ma,Guang Chen,Hangjun Ye,Zhongpu Xia,Long Chen,Qichao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:shown great potential, shown great, great potential, Generative models, Gaussian Mixture Noise

备注

点击查看摘要

Abstract:Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity" to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score. and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at this https URL.

16. 【2602.20055】o Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

链接https://arxiv.org/abs/2602.20055

作者:Apoorva Vashisth(1),Manav Kulshrestha(1),Pranav Bakshi(2),Damon Conover(3),Guillaume Sartoretti(4),Aniket Bera(1) ((1) Purdue University, (2) IIT Kharagpur (3) DEVCOM Army Research Lab (4) National University of Singapore)

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual navigation typically, navigation typically assumes, Visual navigation, Lifelong Interactive Navigation, start and goal

备注

点击查看摘要

Abstract:Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving placing an given object (eg. Alarm clock, Pillow) onto a target object (eg. Dining table, Desk, Bed). To address this lifelong setting - where effects of environment changes accumulate and have long-term effects - we propose an LLM-driven, constraint-based planning framework with active perception. Our framework allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information. This coupling of reasoning and active perception allows the agent to explore the regions expected to contribute to task completion rather than exhaustively mapping the environment. A standard motion planner then executes the corresponding navigate-pick-place, or detour sequence, ensuring reliable low-level control. Evaluated in physics-enabled ProcTHOR-10k simulator, our approach outperforms non-learning and learning-based baselines. We further demonstrate our approach qualitatively on real-world hardware.

17. 【2602.20053】Decoupling Defense Strategies for Robust Image Watermarking

链接https://arxiv.org/abs/2602.20053

作者:Jiahui Chen,Zehang Deng,Zeyu Zhang,Chaoyang Li,Lianchen Jia,Lifeng Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning-based image, Deep learning-based, learning-based image watermarking, remains vulnerable, Deep

备注: CVPR 2026

点击查看摘要

Abstract:Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges: (1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks. To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy. In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality. Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29\%, 33\% and 46\% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.

18. 【2602.20051】SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

链接https://arxiv.org/abs/2602.20051

作者:Yeonsung Kim,Junggeun Do,Seunguk Do,Sangmin Kim,Jaesik Park,Jay-Yoon Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:human pose estimation, characterized by intricate, intricate local, local and global, human pose

备注: 17 pages

点击查看摘要

Abstract:3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectives. We propose SEAL-pose, a data-driven framework in which a learnable loss-net trains a pose-net by evaluating structural plausibility. Rather than relying on hand-crafted priors, our joint-graph-based design enables the loss-net to learn complex structural dependencies directly from data. Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings. Beyond improving each backbone, SEAL-pose also outperforms models with explicit structural constraints, despite not enforcing any such constraints. Finally, we analyze the relationship between the loss-net and structural consistency, and evaluate SEAL-pose in cross-dataset and in-the-wild settings.

19. 【2602.20046】Closing the gap in multimodal medical representation alignment

链接https://arxiv.org/abs/2602.20046

作者:Eleonora Grassucci,Giordano Cicchetti,Danilo Comminiello

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:CLIP has emerged, shared latent space, bringing semantically similar, similar representations closer, de-facto approach

备注: Accepted at MLSP2025

点击查看摘要

Abstract:In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

20. 【2602.20041】EEG-Driven Intention Decoding: Offline Deep Learning Benchmarking on a Robotic Rover

链接https://arxiv.org/abs/2602.20041

作者:Ghadah Alosaimi,Maha Alsayyari,Yixin Sun,Stamos Katsigiannis,Amir Atapour-Abarghouei,Toby P. Breckon

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:navigation remains challenging, Brain-computer interfaces, hands-free control modality, real-world navigation remains, provide a hands-free

备注

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) provide a hands-free control modality for mobile robotics, yet decoding user intent during real-world navigation remains challenging. This work presents a brain-robot control framework for offline decoding of driving commands during robotic rover operation. A 4WD Rover Pro platform was remotely operated by 12 participants who navigated a predefined route using a joystick, executing the commands forward, reverse, left, right, and stop. Electroencephalogram (EEG) signals were recorded with a 16-channel OpenBCI cap and aligned with motor actions at Delta = 0 ms and future prediction horizons (Delta 0 ms). After preprocessing, several deep learning models were benchmarked, including convolutional neural networks, recurrent neural networks, and Transformer architectures. ShallowConvNet achieved the highest performance for both action prediction and intent prediction. By combining real-world robotic control with multi-horizon EEG intention decoding, this study introduces a reproducible benchmark and reveals key design insights for predictive deep learning-based BCI systems.

21. 【2602.20028】Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)

链接https://arxiv.org/abs/2602.20028

作者:Joao Manoel Herrera Pinheiro,Gabriela Do Nascimento Herrera,Luciana Bueno Dos Reis Fernandes,Alvaro Doria Dos Santos,Ricardo V. Godoy,Eduardo A. B. Almeida,Helena Carolina Onody,Marcelo Andrade Da Costa Vieira,Angelica Maria Penteado-Dias,Marcelo Becker

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:hyper-diverse superfamily Ichneumonoidea, Accurate taxonomic identification, Accurate taxonomic, superfamily Ichneumonoidea, Ichneumonidae and Braconidae

备注

点击查看摘要

Abstract:Accurate taxonomic identification is the cornerstone of biodiversity monitoring and agricultural management, particularly for the hyper-diverse superfamily Ichneumonoidea. Comprising the families Ichneumonidae and Braconidae, these parasitoid wasps are ecologically critical for regulating insect populations, yet they remain one of the most taxonomically challenging groups due to their cryptic morphology and vast number of undescribed species. To address the scarcity of robust digital resources for these key groups, we present a curated image dataset designed to advance automated identification systems. The dataset contains 3,556 high-resolution images, primarily focused on Neotropical Ichneumonidae and Braconidae, while also including supplementary families such as Andrenidae, Apidae, Bethylidae, Chrysididae, Colletidae, Halictidae, Megachilidae, Pompilidae, and Vespidae to improve model robustness. Crucially, a subset of 1,739 images is annotated in COCO format, featuring multi-class bounding boxes for the full insect body, wing venation, and scale bars. This resource provides a foundation for developing computer vision models capable of identifying these families.

22. 【2602.20008】oken-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation

链接https://arxiv.org/abs/2602.20008

作者:Louis Fabrice Tshimanga,Andrea Zanola,Federico Del Pup,Manfredo Atzori

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Toggle, Code Toggle Papers, Toggle Hugging Face, Louis Fabrice Tshimanga, Code

备注

点击查看摘要

Abstract:We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets. While Transformers have enabled global interactions among input elements in medical imaging, current computational challenges hinder their deployment on common hardware. Models like (Swin)UNETR adapt the UNet architecture by incorporating (Swin)Transformer encoders, which process tokens that each represent small subvolumes ($8^3$ voxels) of the input. The Transformer attention mechanism scales quadratically with the number of tokens, which is tied to the cubic scaling of 3D input resolution. This work reconsiders the role of convolution and attention, introducing Token-UNets, a family of 3D segmentation models that can operate in constrained computational environments and time frames. To mitigate computational demands, our approach maintains the convolutional encoder of UNet-like models, and applies TokenLearner to 3D feature maps. This module pools a preset number of tokens from local and global structures. Our results show this tokenization effectively encodes task-relevant information, yielding naturally interpretable attention maps. The memory footprint, computation times at inference, and parameter counts of our heaviest model are reduced to 33\%, 10\%, and 35\% of the SwinUNETR values, with better average performance (86.75\% $\pm 0.19\%$ Dice score for SwinUNETR vs our 87.21\% $\pm 0.35\%$). This work opens the way to more efficient trainings in contexts with limited computational resources, such as 3D medical imaging. Easing model optimization, fine-tuning, and transfer-learning in limited hardware settings can accelerate and diversify the development of approaches, for the benefit of the research community.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.20008 [cs.CV]

(or
arXiv:2602.20008v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.20008

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Louis Fabrice Tshimanga [view email] [v1]
Mon, 23 Feb 2026 16:15:38 UTC (2,410 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation, by Louis Fabrice Tshimanga and 3 other authorsView PDFTeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

23. 【2602.19994】RADE-Net: Robust Attention Network for Radar-Only Object Detection in Adverse Weather

链接https://arxiv.org/abs/2602.19994

作者:Christof Leitgeb,Thomas Puchleitner,Max Peter Ronecker,Daniel Watzenig

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automotive perception systems, meet high requirements, Automotive perception, high requirements, systems are obligated

备注: Accepted to 2026 IEEE Intelligent Vehicles Symposium (IV)

点击查看摘要

Abstract:Automotive perception systems are obligated to meet high requirements. While optical sensors such as Camera and Lidar struggle in adverse weather conditions, Radar provides a more robust perception performance, effectively penetrating fog, rain, and snow. Since full Radar tensors have large data sizes and very few datasets provide them, most Radar-based approaches work with sparse point clouds or 2D projections, which can result in information loss. Additionally, deep learning methods show potential to extract richer and more dense features from low level Radar data and therefore significantly increase the perception performance. Therefore, we propose a 3D projection method for fast-Fourier-transformed 4D Range-Azimuth-Doppler-Elevation (RADE) tensors. Our method preserves rich Doppler and Elevation features while reducing the required data size for a single frame by 91.9% compared to a full tensor, thus achieving higher training and inference speed as well as lower model complexity. We introduce RADE-Net, a lightweight model tailored to 3D projections of the RADE tensor. The backbone enables exploitation of low-level and high-level cues of Radar tensors with spatial and channel-attention. The decoupled detection heads predict object center-points directly in the Range-Azimuth domain and regress rotated 3D bounding boxes from rich feature maps in the cartesian scene. We evaluate the model on scenes with multiple different road users and under various weather conditions on the large-scale K-Radar dataset and achieve a 16.7% improvement compared to their baseline, as well as 6.5% improvement over current Radar-only models. Additionally, we outperform several Lidar approaches in scenarios with adverse weather conditions. The code is available under this https URL.

24. 【2602.19974】RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

链接https://arxiv.org/abs/2602.19974

作者:Tianyu Wang,Zhiyuan Ma,Qian Wang,Xinyi Zhang,Xinwei Long,Bowen Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, producing high-quality images, image generation, achieved impressive results, Reflection-based Image Generation

备注

点击查看摘要

Abstract:Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

25. 【2602.19946】When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

链接https://arxiv.org/abs/2602.19946

作者:Krzysztof Adamkiewicz,Brian Moser,Stanislav Frolov,Tobias Christian Nauen,Federico Raue,Andreas Dengel

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:produce visually stunning, visually stunning images, diffusion models produce, demonstrate excellent prompt, models produce visually

备注

点击查看摘要

Abstract:Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

26. 【2602.19944】Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation

链接https://arxiv.org/abs/2602.19944

作者:Yilong Yang,Jianxin Tian,Shengchuan Zhang,Liujuan Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current zero-shot Camouflaged, zero-shot Camouflaged Object, Current zero-shot, camouflaged object discovery, obtain visual prompts

备注: Accepted by CVPR 2026 (main conference)

点击查看摘要

Abstract:Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \textbf{D}iscover-\textbf{S}egment-\textbf{S}elect (\textbf{DSS}) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature-coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state-of-the-art performance on multiple COS benchmarks, especially in multiple-instance scenes.

27. 【2602.19937】Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation

链接https://arxiv.org/abs/2602.19937

作者:Yifei Shi,Boyan Wan,Xin Xu,Kai Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neural implicit fields, rapidly emerging field, implicit fields, neural implicit, Learning neural implicit

备注

点击查看摘要

Abstract:Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object's canonical space-including unobserved regions in camera space-significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model's generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network's accuracy and training efficiency. Our method outperforms the state-of-the-art on three pose estimation datasets. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

28. 【2602.19931】Expanding the Role of Diffusion Models for Robust Classifier Training

链接https://arxiv.org/abs/2602.19931

作者:Pin-Han Huang,Shang-Tse Chen,Hsuan-Tien Lin

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:robust image classifiers, synthetic data, diffusion-generated synthetic data, shown to substantially, robust classifier training

备注

点击查看摘要

Abstract:Incorporating diffusion-generated synthetic data into adversarial training (AT) has been shown to substantially improve the training of robust image classifiers. In this work, we extend the role of diffusion models beyond merely generating synthetic data, examining whether their internal representations, which encode meaningful features of the data, can provide additional benefits for robust classifier training. Through systematic experiments, we show that diffusion models offer representations that are both diverse and partially robust, and that explicitly incorporating diffusion representations as an auxiliary learning signal during AT consistently improves robustness across settings. Furthermore, our representation analysis indicates that incorporating diffusion models into AT encourages more disentangled features, while diffusion representations and diffusion-generated synthetic data play complementary roles in shaping representations. Experiments on CIFAR-10, CIFAR-100, and ImageNet validate these findings, demonstrating the effectiveness of jointly leveraging diffusion representations and synthetic data within AT.

29. 【2602.19916】Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting

链接https://arxiv.org/abs/2602.19916

作者:Yixin Yang,Bojian Wu,Yang Zhou,Hui Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Gaussian Splatting, radiance field reconstruction, Gaussian, Due, Splatting

备注: Accepted to ICLR 2026. Project page: \url{ [this https URL](https://xiaoxinyyx.github.io/augs) }

点击查看摘要

Abstract:Due to the real-time rendering performance, 3D Gaussian Splatting (3DGS) has emerged as the leading method for radiance field reconstruction. However, its reliance on spherical harmonics for color encoding inherently limits its ability to separate diffuse and specular components, making it challenging to accurately represent complex reflections. To address this, we propose a novel enhanced Gaussian kernel that explicitly models specular effects through view-dependent opacity. Meanwhile, we introduce an error-driven compensation strategy to improve rendering quality in existing 3DGS scenes. Our method begins with 2D Gaussian initialization and then adaptively inserts and optimizes enhanced Gaussian kernels, ultimately producing an augmented radiance field. Experiments demonstrate that our method not only surpasses state-of-the-art NeRF methods in rendering performance but also achieves greater parameter efficiency. Project page at: this https URL.

30. 【2602.19910】Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

链接https://arxiv.org/abs/2602.19910

作者:Wei He,Xianghan Meng,Zhiyuan Huang,Xianbiao Qi,Rong Xiao,Chun-Guang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalized Category Discovery, Generalized Category, Category Discovery, open-set recognition problem, challenging open-set recognition

备注: 15 pages, accepted by CVPR 2026

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

31. 【2602.19907】Gradient based Severity Labeling for Biomarker Classification in OCT

链接https://arxiv.org/abs/2602.19907

作者:Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib,Stephanie Trejo Corona,Charles Wykoff

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:contrastive learning, selection strategy, supervised contrastive learning, contrastive learning setup, contrastive

备注: Accepted at International Conference on Image Processing (ICIP) 2022

点击查看摘要

Abstract:In this paper, we propose a novel selection strategy for contrastive learning for medical images. On natural images, contrastive learning uses augmentations to select positive and negative pairs for the contrastive loss. However, in the medical domain, arbitrary augmentations have the potential to distort small localized regions that contain the biomarkers we are interested in detecting. A more intuitive approach is to select samples with similar disease severity characteristics, since these samples are more likely to have similar structures related to the progression of a disease. To enable this, we introduce a method that generates disease severity labels for unlabeled OCT scans on the basis of gradient responses from an anomaly detection algorithm. These labels are used to train a supervised contrastive learning setup to improve biomarker classification accuracy by as much as 6% above self-supervised baselines for key indicators of Diabetic Retinopathy.

32. 【2602.19900】ExpPortrait: Expressive Portrait Generation via Personalized Representation

链接https://arxiv.org/abs/2602.19900

作者:Junyi Wang,Yudong Guo,Boyang Guo,Shengming Yang,Juyong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:shown great potential, controllable cinematic portrait, cinematic portrait videos, portrait videos remains, significant challenge

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

33. 【2602.19896】Monocular Mesh Recovery and Body Measurement of Female Saanen Goats

链接https://arxiv.org/abs/2602.19896

作者:Bo Jin,Shichao Zhao,Jin Lyu,Bin Zhang,Tao Yu,Liang An,Yebin Liu,Meili Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high milk yield, milk production potential, lack goat-specific authentic, assessing milk production, Saanen dairy goats

备注: Accepted to AAAI2026

点击查看摘要

Abstract:The lactation performance of Saanen dairy goats, renowned for their high milk yield, is intrinsically linked to their body size, making accurate 3D body measurement essential for assessing milk production potential, yet existing reconstruction methods lack goat-specific authentic 3D data. To address this limitation, we establish the FemaleSaanenGoat dataset containing synchronized eight-view RGBD videos of 55 female Saanen goats (6-18 months). Using multi-view DynamicFusion, we fuse noisy, non-rigid point cloud sequences into high-fidelity 3D scans, overcoming challenges from irregular surfaces and rapid movement. Based on these scans, we develop SaanenGoat, a parametric 3D shape model specifically designed for female Saanen goats. This model features a refined template with 41 skeletal joints and enhanced udder representation, registered with our scan data. A comprehensive shape space constructed from 48 goats enables precise representation of diverse individual variations. With the help of SaanenGoat model, we get high-precision 3D reconstruction from single-view RGBD input, and achieve automated measurement of six critical body dimensions: body length, height, chest width, chest girth, hip width, and hip height. Experimental results demonstrate the superior accuracy of our method in both 3D reconstruction and body measurement, presenting a novel paradigm for large-scale 3D vision applications in precision livestock farming.

34. 【2602.19881】Make Some Noise: Unsupervised Remote Sensing Change Detection Using Latent Space Perturbations

链接https://arxiv.org/abs/2602.19881

作者:Blaž Rolih,Matic Fučka,Filip Wolf,Luka Čehovin Zajc

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Unsupervised change detection, remote sensing aims, localise semantic changes, Unsupervised change, remote sensing

备注

点击查看摘要

Abstract:Unsupervised change detection (UCD) in remote sensing aims to localise semantic changes between two images of the same region without relying on labelled data during training. Most recent approaches rely either on frozen foundation models in a training-free manner or on training with synthetic changes generated in pixel space. Both strategies inherently rely on predefined assumptions about change types, typically introduced through handcrafted rules, external datasets, or auxiliary generative models. Due to these assumptions, such methods fail to generalise beyond a few change types, limiting their real-world usage, especially in rare or complex scenarios. To address this, we propose MaSoN (Make Some Noise), an end-to-end UCD framework that synthesises diverse changes directly in the latent feature space during training. It generates changes that are dynamically estimated using feature statistics of target data, enabling diverse yet data-driven variation aligned with the target domain. It also easily extends to new modalities, such as SAR. MaSoN generalises strongly across diverse change types and achieves state-of-the-art performance on five benchmarks, improving the average F1 score by 14.1 percentage points. Project page: this https URL

35. 【2602.19874】BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations

链接https://arxiv.org/abs/2602.19874

作者:Lucas Martini,Alexander Lappe,Anna Bognár,Rufin Vogels,Martin A. Giese

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medicine and neuroscience, advancing ethology, fundamental for advancing, pose, animal tracking methods

备注

点击查看摘要

Abstract:The recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that are unable to fully capture the richness of action dynamics. To address this gap, we introduce the $\textbf{Big Ma}$ca$\textbf{Q}$ue 3D Motion and Animation Dataset ($\texttt{BigMaQ}$), a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions. Extending previous surface-based animal tracking methods, we construct subject-specific textured avatars by adapting a high-quality macaque template mesh to individual monkeys. This allows us to provide pose descriptions that are more accurate than previous state-of-the-art surface-based animal tracking methods. From the original dataset, we derive BigMaQ500, an action recognition benchmark that links surface-based pose vectors to single frames across multiple individual monkeys. By pairing features extracted from established image and video encoders with and without our pose descriptors, we demonstrate substantial improvements in mean average precision (mAP) when pose information is included. With these contributions, $\texttt{BigMaQ}$ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates. The code and data are publicly available at this https URL .

36. 【2602.19872】GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery

链接https://arxiv.org/abs/2602.19872

作者:Jizhou Han,Chenhao Ding,SongLin Dong,Yuhang He,Shaokun Wang,Qiang Wang,Yihong Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Continual Generalized Category, Generalized Category Discovery, Generalized Category, Equiangular Tight Frame, requires identifying

备注: Accept by AAAI 2026

点击查看摘要

Abstract:Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled samples and confidence-guided alignment for novel samples, enabling stable integration of new classes without disrupting old ones. Experiments on four benchmarks show that GOAL outperforms the prior method Happy, reducing forgetting by 16.1% and boosting novel class discovery by 3.2%, establishing a strong solution for long-horizon continual discovery.

37. 【2602.19870】ApET: Approximation-Error Guided Token Compression for Efficient VLMs

链接https://arxiv.org/abs/2602.19870

作者:Qiankun Ma,Ziyao Zhang,Haofei Wang,Jie Chen,Zhen Song,Hairong Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent Vision-Language Models, multimodal understanding capabilities, demonstrated remarkable multimodal, remarkable multimodal understanding, incur prohibitive computational

备注: CVPR2026

点击查看摘要

Abstract:Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at this https URL.

38. 【2602.19863】Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

链接https://arxiv.org/abs/2602.19863

作者:Filip Wolf,Blaž Rolih,Luka Čehovin Zajc

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transforming Earth Observation, Earth Observation, transforming Earth, universal model unrealistic, single universal model

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Project page: \textcolor{magenta}{this https URL}.

39. 【2602.19857】Contrastive meta-domain adaptation for robust skin lesion classification across clinical and acquisition conditions

链接https://arxiv.org/abs/2602.19857

作者:Rodrigo Mota,Kelvin Cunha,Emanoel dos Santos,Fábio Papais,Francisco Filho,Thales Bezerra,Erico Medeiros,Paulo Borba,Tsang Ing Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:analysis remain sensitive, Deep learning models, domain-specific visual characteristics, dermatological image analysis, image analysis remain

备注: 4 pages, 5 figures, 1 table, isbi2026

点击查看摘要

Abstract:Deep learning models for dermatological image analysis remain sensitive to acquisition variability and domain-specific visual characteristics, leading to performance degradation when deployed in clinical settings. We investigate how visual artifacts and domain shifts affect deep learning-based skin lesion classification. We propose an adaptation strategy, grounded in the idea of visual meta-domains, that transfers visual representations from larger dermoscopic datasets into clinical image domains, thereby improving generalization robustness. Experiments across multiple dermatology datasets show consistent gains in classification performance and reduced gaps between dermoscopic and clinical images. These results emphasize the importance of domain-aware training for deployable systems.

40. 【2602.19848】DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

链接https://arxiv.org/abs/2602.19848

作者:Francisco Filho,Kelvin Cunha,Fábio Papais,Emanoel dos Santos,Rodrigo Mota,Thales Bezerra,Erico Medeiros,Paulo Borba,Tsang Ing Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe class imbalance, cases significantly underrepresented, deep learning training, Skin lesion classification, malignant cases significantly

备注: 4 pages, 2 figures, 1 table, isbi2026

点击查看摘要

Abstract:Skin lesion classification datasets often suffer from severe class imbalance, with malignant cases significantly underrepresented, leading to biased decision boundaries during deep learning training. We address this challenge using class-conditioned diffusion models to generate synthetic dermatological images, followed by self-supervised MAE pretraining to enable huge ViT models to learn robust, domain-relevant features. To support deployment in practical clinical settings, where lightweight models are required, we apply knowledge distillation to transfer these representations to a smaller ViT student suitable for mobile devices. Our results show that MAE pretraining on synthetic data, combined with distillation, improves classification performance while enabling efficient on-device inference for practical clinical use.

41. 【2602.19832】M3S-Net: Multimodal Feature Fusion Network Based on Multi-scale Data for Ultra-short-term PV Power Forecasting

链接https://arxiv.org/abs/2602.19832

作者:Penghui Niu,Taotao Cai,Suqi Zhang,Junhua Gu,Ping Zhang,Qiqi Liu,Jianxin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present significant stability, high-penetration photovoltaic grids, significant stability challenges, rapid cloud advection, solar irradiance

备注

点击查看摘要

Abstract:The inherent intermittency and high-frequency variability of solar irradiance, particularly during rapid cloud advection, present significant stability challenges to high-penetration photovoltaic grids. Although multimodal forecasting has emerged as a viable mitigation strategy, existing architectures predominantly rely on shallow feature concatenation and binary cloud segmentation, thereby failing to capture the fine-grained optical features of clouds and the complex spatiotemporal coupling between visual and meteorological modalities. To bridge this gap, this paper proposes M3S-Net, a novel multimodal feature fusion network based on multi-scale data for ultra-short-term PV power forecasting. First, a multi-scale partial channel selection network leverages partial convolutions to explicitly isolate the boundary features of optically thin clouds, effectively transcending the precision limitations of coarse-grained binary masking. Second, a multi-scale sequence to image analysis network employs Fast Fourier Transform (FFT)-based time-frequency representation to disentangle the complex periodicity of meteorological data across varying time horizons. Crucially, the model incorporates a cross-modal Mamba interaction module featuring a novel dynamic C-matrix swapping mechanism. By exchanging state-space parameters between visual and temporal streams, this design conditions the state evolution of one modality on the context of the other, enabling deep structural coupling with linear computational complexity, thus overcoming the limitations of shallow concatenation. Experimental validation on the newly constructed fine-grained PV power dataset demonstrates that M3S-Net achieves a mean absolute error reduction of 6.2% in 10-minute forecasts compared to state-of-the-art baselines. The dataset and source code will be available at this https URL.

42. 【2602.19828】xtShield-R1: Reinforced Reasoning for Tampered Text Detection

链接https://arxiv.org/abs/2602.19828

作者:Chenfan Qu,Yiwu Zhong,Jian Liu,Xuekang Zhu,Bohan Yu,Lianwen Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tampered text detection, security threats, highlighting the urgent, tampered text, tampered images poses

备注: AAAI 2026

点击查看摘要

Abstract:The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-R1, the first reinforcement learning based MLLM solution for tampered text detection and reasoning. Specifically, our approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. During fine-tuning, we perform Group Relative Policy Optimization with novel reward functions to reduce annotation dependency and improve reasoning capabilities. At inference time, we enhance localization accuracy via OCR Rectification, a method that leverages the MLLM's strong text recognition abilities to refine its predictions. Furthermore, to support rigorous evaluation, we introduce the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains. Rich reasoning-style annotations are included, allowing for comprehensive assessment. Our TFR benchmark simultaneously addresses seven major limitations of existing benchmarks and enables robust evaluation under cross-style, cross-method, and cross-language conditions. Extensive experiments demonstrate that TextShield-R1 significantly advances the state of the art in interpretable tampered text detection.

43. 【2602.19823】Open-vocabulary 3D scene perception in industrial environments

链接https://arxiv.org/abs/2602.19823

作者:Keno Moenck,Adrian Philip Florea,Julian Koch,Thorsten Schüppstuhl

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous vision applications, manufacturing environments require, Autonomous vision, environments require perception, require perception capabilities

备注

点击查看摘要

Abstract:Autonomous vision applications in production, intralogistics, or manufacturing environments require perception capabilities beyond a small, fixed set of classes. Recent open-vocabulary methods, leveraging 2D Vision-Language Foundation Models (VLFMs), target this task but often rely on class-agnostic segmentation models pre-trained on non-industrial datasets (e.g., household scenes). In this work, we first demonstrate that such models fail to generalize, performing poorly on common industrial objects. Therefore, we propose a training-free, open-vocabulary 3D perception pipeline that overcomes this limitation. Instead of using a pre-trained model to generate instance proposals, our method simply generates masks by merging pre-computed superpoints based on their semantic features. Following, we evaluate the domain-adapted VLFM "IndustrialCLIP" on a representative 3D industrial workshop scene for open-vocabulary querying. Our qualitative results demonstrate successful segmentation of industrial objects.

44. 【2602.19822】Efficient endometrial carcinoma screening via cross-modal synthesis and gradient distillation

链接https://arxiv.org/abs/2602.19822

作者:Dongjing Shan,Yamei Luo,Jiqing Xuan,Lu Huang,Jin Li,Mengchu Yang,Zeyu Chen,Fajin Lv,Yong Tang,Chunxiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:prevalent global malignancy, Early detection, primary care settings, resource-constrained primary care, endometrial carcinoma

备注

点击查看摘要

Abstract:Early detection of myometrial invasion is critical for the staging and life-saving management of endometrial carcinoma (EC), a prevalent global malignancy. Transvaginal ultrasound serves as the primary, accessible screening modality in resource-constrained primary care settings; however, its diagnostic reliability is severely hindered by low tissue contrast, high operator dependence, and a pronounced scarcity of positive pathological samples. Existing artificial intelligence solutions struggle to overcome this severe class imbalance and the subtle imaging features of invasion, particularly under the strict computational limits of primary care clinics. Here we present an automated, highly efficient two-stage deep learning framework that resolves both data and computational bottlenecks in EC screening. To mitigate pathological data scarcity, we develop a structure-guided cross-modal generation network that synthesizes diverse, high-fidelity ultrasound images from unpaired magnetic resonance imaging (MRI) data, strictly preserving clinically essential anatomical junctions. Furthermore, we introduce a lightweight screening network utilizing gradient distillation, which transfers discriminative knowledge from a high-capacity teacher model to dynamically guide sparse attention towards task-critical regions. Evaluated on a large, multicenter cohort of 7,951 participants, our model achieves a sensitivity of 99.5\%, a specificity of 97.2\%, and an area under the curve of 0.987 at a minimal computational cost (0.289 GFLOPs), substantially outperforming the average diagnostic accuracy of expert sonographers. Our approach demonstrates that combining cross-modal synthetic augmentation with knowledge-driven efficient modeling can democratize expert-level, real-time cancer screening for resource-constrained primary care settings.

45. 【2602.19768】raceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

链接https://arxiv.org/abs/2602.19768

作者:Fan Yang,Shurong Zheng,Hongyin Zhao,Yufei Zhan,Xin Li,Yousong Zhu,Chaoyang Zhao Ming Tang,Jinqiao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent Large Vision-Language, Recent Large, Large Vision-Language Models, natural language generation, demonstrate remarkable capabilities

备注

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.

46. 【2602.19766】One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image

链接https://arxiv.org/abs/2602.19766

作者:Pengfei Wang,Liyi Chen,Zhiyuan Ma,Yanjun Guo,Guowen Zhang,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly challenging problem, Generating explorable, highly challenging, Generating, challenging problem

备注: ICLR 2026

点击查看摘要

Abstract:Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \textbf{One2Scene}, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network. Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to leverage robust geometric priors learned from large-scale multi-view datasets. A bidirectional feature fusion module is used to enforce cross-view consistency, yielding an efficient and geometrically reliable scaffold. Finally, the scaffold serves as a strong prior for a novel view generator to produce photorealistic and geometrically accurate views at arbitrary cameras. By explicitly conditioning on a 3D-consistent scaffold to perform reconstruction, One2Scene works stably under large camera motions, supporting immersive scene exploration. Extensive experiments show that One2Scene substantially outperforms state-of-the-art methods in panorama depth estimation, feed-forward 360° reconstruction, and explorable 3D scene generation. Code and models will be released.

47. 【2602.19763】raining Deep Stereo Matching Networks on Tree Branch Imagery: A Benchmark Study for Real-Time UAV Forestry Applications

链接https://arxiv.org/abs/2602.19763

作者:Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Autonomous drone-based tree, real-time depth estimation, Autonomous drone-based, drone-based tree pruning, pruning needs accurate

备注

点击查看摘要

Abstract:Autonomous drone-based tree pruning needs accurate, real-time depth estimation from stereo cameras. Depth is computed from disparity maps using $Z = f B/d$, so even small disparity errors cause noticeable depth mistakes at working distances. Building on our earlier work that identified DEFOM-Stereo as the best reference disparity generator for vegetation scenes, we present the first study to train and test ten deep stereo matching networks on real tree branch images. We use the Canterbury Tree Branches dataset -- 5,313 stereo pairs from a ZED Mini camera at 1080P and 720P -- with DEFOM-generated disparity maps as training targets. The ten methods cover step-by-step refinement, 3D convolution, edge-aware attention, and lightweight designs. Using perceptual metrics (SSIM, LPIPS, ViTScore) and structural metrics (SIFT/ORB feature matching), we find that BANet-3D produces the best overall quality (SSIM = 0.883, LPIPS = 0.157), while RAFT-Stereo scores highest on scene-level understanding (ViTScore = 0.799). Testing on an NVIDIA Jetson Orin Super (16 GB, independently powered) mounted on our drone shows that AnyNet reaches 6.99 FPS at 1080P -- the only near-real-time option -- while BANet-2D gives the best quality-speed balance at 1.21 FPS. We also compare 720P and 1080P processing times to guide resolution choices for forestry drone systems.

48. 【2602.19756】Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

链接https://arxiv.org/abs/2602.19756

作者:Junhyeok Choi,Sangwoo Mo,Minwoo Chae

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse vision-language tasks, achieved remarkable success, Recent advances, Dataset distillation, vision-language tasks

备注

点击查看摘要

Abstract:Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

49. 【2602.19753】RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing

链接https://arxiv.org/abs/2602.19753

作者:Kaifa Yang,Qi Yang,Yiling Xu,Zhu Li

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Gaussian Splatting, technology for high-quality, leading technology, Splatting, reconstruction

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission. Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are sensitive to the number and selection of views, rely on specialized differentiable rasterizers, and have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules and limiting scalability and generalization. To address these issues, we propose RAP, a fast feedforward rendering-free attribute-guided method for efficient importance score prediction in 3DGS. RAP infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding rendering-based or visibility-dependent computations. A compact MLP predicts per-primitive importance scores using rendering loss, pruning-aware loss, and significance distribution regularization. After training on a small set of scenes, RAP generalizes effectively to unseen data and can be seamlessly integrated into reconstruction, compression, and transmission pipelines. Our code is publicly available at this https URL.

50. 【2602.19736】InfScene-SR: Spatially Continuous Inference for Arbitrary-Size Image Super-Resolution

链接https://arxiv.org/abs/2602.19736

作者:Shoukun Sun,Zhe Wang,Xiang Que,Jiyin Zhang,Xiaogang Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative Adversarial Networks, Denoising Diffusion Probabilistic, Adversarial Networks, recently shown superior, shown superior performance

备注

点击查看摘要

Abstract:Image Super-Resolution (SR) aims to recover high-resolution (HR) details from low-resolution (LR) inputs, a task where Denoising Diffusion Probabilistic Models (DDPMs) have recently shown superior performance compared to Generative Adversarial Networks (GANs) based approaches. However, standard diffusion-based SR models, such as SR3, are typically trained on fixed-size patches and struggle to scale to arbitrary-sized images due to memory constraints. Applying these models via independent patch processing leads to visible seams and inconsistent textures across boundaries. In this paper, we propose InfScene-SR, a framework enabling spatially continuous super-resolution for large, arbitrary scenes. We adapt the iterative refinement process of diffusion models with a novel guided and variance-corrected fusion mechanism, allowing for the seamless generation of large-scale high-resolution imagery without retraining. We validate our approach on remote sensing datasets, demonstrating that InfScene-SR not only reconstructs fine details with high perceptual quality but also eliminates boundary artifacts, benefiting downstream tasks such as semantic segmentation.

51. 【2602.19735】VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

链接https://arxiv.org/abs/2602.19735

作者:Jingyi Xu,Zhangshuo Qi,Zhongmiao Yan,Xuyu Gao,Qianyun Jiao,Songpengcheng Xia,Xieyuanli Chen,Ling Pei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:loop closure detection, robust place recognition, multimodal place recognition, place recognition, Geometry Grounded Transformer

备注

点击查看摘要

Abstract:In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

52. 【2602.19723】owards Personalized Multi-Modal MRI Synthesis across Heterogeneous Datasets

链接https://arxiv.org/abs/2602.19723

作者:Yue Zhang,Zhizheng Zhuo,Siyao Xu,Shan Lv,Zhaoxi Liu,Jun Qiu,Qiuli Wang,Yaou Liu,S. Kevin Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, ensuring diagnostic completeness, multi-modal magnetic resonance, Synthesizing missing modalities, motion artifacts

备注: 19 pages, 4 figures

点击查看摘要

Abstract:Synthesizing missing modalities in multi-modal magnetic resonance imaging (MRI) is vital for ensuring diagnostic completeness, particularly when full acquisitions are infeasible due to time constraints, motion artifacts, and patient tolerance. Recent unified synthesis models have enabled flexible synthesis tasks by accommodating various input-output configurations. However, their training and evaluation are typically restricted to a single dataset, limiting their generalizability across diverse clinical datasets and impeding practical deployment. To address this limitation, we propose PMM-Synth, a personalized MRI synthesis framework that not only supports various synthesis tasks but also generalizes effectively across heterogeneous datasets. PMM-Synth is jointly trained on multiple multi-modal MRI datasets that differ in modality coverage, disease types, and intensity distributions. It achieves cross-dataset generalization through three core innovations: a Personalized Feature Modulation module that dynamically adapts feature representations based on dataset identifier to mitigate the impact of distributional shifts; a Modality-Consistent Batch Scheduler that facilitates stable and efficient batch training under inconsistent modality conditions; and a selective supervision loss to ensure effective learning when ground truth modalities are partially missing. Evaluated on four clinical multi-modal MRI datasets, PMM-Synth consistently outperforms state-of-the-art methods in both one-to-one and many-to-one synthesis tasks, achieving superior PSNR and SSIM scores. Qualitative results further demonstrate improved preservation of anatomical structures and pathological details. Additionally, downstream tumor segmentation and radiological reporting studies suggest that PMM-Synth holds potential for supporting reliable diagnosis under real-world modality-missing scenarios.

53. 【2602.19719】Generative 6D Pose Estimation via Conditional Flow Matching

链接https://arxiv.org/abs/2602.19719

作者:Amir Hamza,Davide Boscaini,Weihang Li,Benjamin Busam,Fabio Poiesi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:estimation typically rely, Existing methods, typically rely, rely on neural, neural networks

备注: Project Website : [this https URL](https://tev-fbk.github.io/Flose/)

点击查看摘要

Abstract:Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in $\mathrm{SE}(3)$ or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in $\mathbb{R}^3$. We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : this https URL

54. 【2602.19715】Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

链接https://arxiv.org/abs/2602.19715

作者:Kartik Kuckreja,Parul Gupta,Muhammad Haris Khan,Abhinav Dhall

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generate natural-language explanations, limiting reliability, natural-language explanations, reasoning, generate natural-language

备注: CVPR-2026, Code is available here: [this https URL](https://github.com/KjAeRsTuIsK/DeepfakeJudge)

点击查看摘要

Abstract:Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{this https URL}{open-sourced}.

55. 【2602.19710】Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

链接https://arxiv.org/abs/2602.19710

作者:Haitao Lin,Hanyang Yu,Jingshun Huang,He Zhang,Yonggen Ling,Ping Tan,Xiangyang Xue,Yanwei Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:entangle high-level perception, Visual Question Answering, embodiment-specific action supervision, perception with sparse, suffer from feature

备注

点击查看摘要

Abstract:Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

Cite as:
arXiv:2602.19710 [cs.CV]

(or
arXiv:2602.19710v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.19710

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
56. 【2602.19708】ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

链接https://arxiv.org/abs/2602.19708

作者:Hoyoung Kim,Minwoo Jang,Jabin Koo,Sangdoo Yun,Jungseul Ok

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:specialized domains including, general recognition tasks, domains including privacy-constrained, including privacy-constrained medical, privacy-constrained medical applications

备注

点击查看摘要

Abstract:Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA~$A$ for class priors and per-image LoRAs~$\mathcal{B}$ for image-specific characteristics. To expose coherent class semantics in the shared LoRA~$A$, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose $A$ with a mixture of $\mathcal{B}$ using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.

57. 【2602.19706】HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion

链接https://arxiv.org/abs/2602.19706

作者:Yo-Tin Lin,Su-Kai Chen,Hou-Ning Hu,Yen-Yu Lin,Yu-Lun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complete information loss, Single LDR, HDR reconstruction remains, HDR reconstruction, information loss

备注: WACV 2026. Project page: [this https URL](https://github.com/EusdenLin/HDR-Reconstruction-Boosting)

点击查看摘要

Abstract:Single LDR to HDR reconstruction remains challenging for over-exposed regions where traditional methods often fail due to complete information loss. We present a training-free approach that enhances existing indirect and direct HDR reconstruction methods through diffusion-based inpainting. Our method combines text-guided diffusion models with SDEdit refinement to generate plausible content in over-exposed areas while maintaining consistency across multi-exposure LDR images. Unlike previous approaches requiring extensive training, our method seamlessly integrates with existing HDR reconstruction techniques through an iterative compensation mechanism that ensures luminance coherence across multiple exposures. We demonstrate significant improvements in both perceptual quality and quantitative metrics on standard HDR datasets and in-the-wild captures. Results show that our method effectively recovers natural details in challenging scenarios while preserving the advantages of existing HDR reconstruction pipelines. Project page: this https URL

58. 【2602.19698】Iconographic Classification and Content-Based Recommendation for Digitized Artworks

链接https://arxiv.org/abs/2602.19698

作者:Krzysztof Kutt,Maciej Baczyński

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:selected artificial intelligence, automates iconographic classification, artificial intelligence methods, system that automates, automates iconographic

备注: 14 pages, 7 figures; submitted to ICCS 2026 conference

点击查看摘要

Abstract:We present a proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods. The prototype implements a four-stage workflow for classification and recommendation, which integrates YOLOv8 object detection with algorithmic mappings to Iconclass codes, rule-based inference for abstract meanings, and three complementary recommenders (hierarchical proximity, IDF-weighted overlap, and Jaccard similarity). Although more engineering is still needed, the evaluation demonstrates the potential of this solution: Iconclass-aware computer vision and recommendation methods can accelerate cataloging and enhance navigation in large heritage repositories. The key insight is to let computer vision propose visible elements and to use symbolic structures (Iconclass hierarchy) to reach meaning.

59. 【2602.19697】BayesFusion-SDF: Probabilistic Signed Distance Fusion with View Planning on CPU

链接https://arxiv.org/abs/2602.19697

作者:Soumya Mazumdar,Vineet Kumar Rakesh,Tapas Samanta

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:Key part, augmented reality, part of robotics, inspection is dense, digital inspection

备注

点击查看摘要

Abstract:Key part of robotics, augmented reality, and digital inspection is dense 3D reconstruction from depth observations. Traditional volumetric fusion techniques, including truncated signed distance functions (TSDF), enable efficient and deterministic geometry reconstruction; however, they depend on heuristic weighting and fail to transparently convey uncertainty in a systematic way. Recent neural implicit methods, on the other hand, get very high fidelity but usually need a lot of GPU power for optimization and aren't very easy to understand for making decisions later on. This work presents BayesFusion-SDF, a CPU-centric probabilistic signed distance fusion framework that conceptualizes geometry as a sparse Gaussian random field with a defined posterior distribution over voxel distances. First, a rough TSDF reconstruction is used to create an adaptive narrow-band domain. Then, depth observations are combined using a heteroscedastic Bayesian formulation that is solved using sparse linear algebra and preconditioned conjugate gradients. Randomized diagonal estimators are a quick way to get an idea of posterior uncertainty. This makes it possible to extract surfaces and plan the next best view while taking into account uncertainty. Tests on a controlled ablation scene and a CO3D object sequence show that the new method is more accurate geometrically than TSDF baselines and gives useful estimates of uncertainty for active sensing. The proposed formulation provides a clear and easy-to-use alternative to GPU-heavy neural reconstruction methods while still being able to be understood in a probabilistic way and acting in a predictable way. GitHub: this https URL

60. 【2602.19679】HOR: Text-Guided 3D Human and Object Reconstruction with Textures

链接https://arxiv.org/abs/2602.19679

作者:Hyeongjin Nam,Daniel Sungho Jung,Kyoung Mu Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:active research area, digital content creation, research area, content creation, Joint reconstruction

备注: Published at CVPR 2026, 20 pages including the supplementary material

点击查看摘要

Abstract:Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

61. 【2602.19668】Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation

链接https://arxiv.org/abs/2602.19668

作者:He Zhu,Ren Togo,Takahiro Ogawa,Kenji Hirata,Minghui Tang,Takaaki Yoshimura,Hiroyuki Sugimori,Noriko Nishioka,Yukie Shimizu,Kohsuke Kudo,Miki Haseyama

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remains challenging due, strict privacy constraints, medical report generation, Longitudinal medical report, disease progression

备注

点击查看摘要

Abstract:Longitudinal medical report generation is clinically important yet remains challenging due to strict privacy constraints and the evolving nature of disease progression. Although federated learning (FL) enables collaborative training without data sharing, existing FL methods largely overlook longitudinal dynamics by assuming stationary client distributions, making them unable to model temporal shifts across visits or patient-specific heterogeneity-ultimately leading to unstable optimization and suboptimal report generation. We introduce Federated Temporal Adaptation (FTA), a federated setting that explicitly accounts for the temporal evolution of client data. Building upon this setting, we propose FedTAR, a framework that integrates demographic-driven personalization with time-aware global aggregation. FedTAR generates lightweight LoRA adapters from demographic embeddings and performs temporal residual aggregation, where updates from different visits are weighted by a meta-learned temporal policy optimized via first-order MAML. Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust and privacy-preserving paradigm for federated longitudinal modeling.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.19668 [cs.CV]

(or
arXiv:2602.19668v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.19668

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
62. 【2602.19631】Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

链接https://arxiv.org/abs/2602.19631

作者:Uichan Lee,Jeonghyeon Kim,Sangheum Hwang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:diffusion models, widespread adoption, rapid and widespread, Recent advances, Abstract

备注: Accepted at ICLR 2026. The first two authors contributed equally

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.

63. 【2602.19624】Accurate Planar Tracking With Robust Re-Detection

链接https://arxiv.org/abs/2602.19624

作者:Jonas Serych,Jiri Matas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:combine robust long-term, homography pose estimation, robust long-term segmentation, provided by SAM, long-term segmentation tracking

备注

点击查看摘要

Abstract:We present SAM-H and WOFTSAM, novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation. SAM-H estimates homographies from segmentation mask contours and is thus highly robust to target appearance changes. WOFTSAM significantly improves the current state-of-the-art planar tracker WOFT by exploiting lost target re-detection provided by SAM-H. The proposed methods are evaluated on POT-210 and PlanarTrack tracking benchmarks, setting the new state-of-the-art performance on both. On the latter, they outperform the second best by a large margin, +12.4 and +15.2pp on the p@15 metric. We also present improved ground-truth annotations of initial PlanarTrack poses, enabling more accurate benchmarking in the high-precision p@5 metric. The code and the re-annotations are available at this https URL

64. 【2602.19623】PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring

链接https://arxiv.org/abs/2602.19623

作者:Injun Baek,Yearim Kim,Nojun Kwak

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:democratizing content creation, Mayer Cognitive Theory, content creation, current models, offer a promising

备注

点击查看摘要

Abstract:While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces PedaCo-Gen, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer's Cognitive Theory of Multimedia Learning (CTML). Moving away from traditional "one-shot" generation, PedaCo-Gen introduces an Intermediate Representation (IR) phase, enabling educators to interactively review and refine video blueprints-comprising scripts and visual descriptions-with an AI reviewer. Our study with 23 education experts demonstrates that PedaCo-Gen significantly enhances video quality across various topics and CTML principles compared to baselines. Participants perceived the AI-driven guidance not merely as a set of instructions but as a metacognitive scaffold that augmented their instructional design expertise, reporting high production efficiency (M=4.26) and guide validity (M=4.04). These findings highlight the importance of reclaiming pedagogical agency through principled co-creation, providing a foundation for future AI authoring tools that harmonize generative power with human professional expertise.

65. 【2602.19615】Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

链接https://arxiv.org/abs/2602.19615

作者:Xin Hu,Haomiao Ni,Yunbei Zhang,Jihun Hamm,Zechen Li,Zhengming Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, broad visual understanding, Vision language models, achieved remarkable, remarkable success

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.

66. 【2602.19611】RAID: Retrieval-Augmented Anomaly Detection

链接https://arxiv.org/abs/2602.19611

作者:Mingxiu Cai,Zhe Zhang,Gaochang Wu,Tianyou Chai,Xiatian Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unsupervised Anomaly Detection, identify abnormal regions, Unsupervised Anomaly, test images, aims to identify

备注

点击查看摘要

Abstract:Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce \textbf{RAID}, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. \href{this https URL}{this https URL}.

67. 【2602.19608】Satellite-Based Detection of Looted Archaeological Sites Using Machine Learning

链接https://arxiv.org/abs/2602.19608

作者:Girmaw Abebe Tadesse,Titien Bartette,Andrew Hassanali,Allen Kim,Jonathan Chemla,Andrew Zolli,Yves Ubelmann,Caleb Robinson,Inbal Becker-Reshef,Juan Lavista Ferres

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains operationally difficult, archaeological sites poses, remote locations remains, locations remains operationally, archaeological sites

备注

点击查看摘要

Abstract:Looting at archaeological sites poses a severe risk to cultural heritage, yet monitoring thousands of remote locations remains operationally difficult. We present a scalable and satellite-based pipeline to detect looted archaeological sites, using PlanetScope monthly mosaics (4.7m/pixel) and a curated dataset of 1,943 archaeological sites in Afghanistan (898 looted, 1,045 preserved) with multi-year imagery (2016--2023) and site-footprint masks. We compare (i) end-to-end CNN classifiers trained on raw RGB patches and (ii) traditional machine learning (ML) trained on handcrafted spectral/texture features and embeddings from recent remote-sensing foundation models. Results indicate that ImageNet-pretrained CNNs combined with spatial masking reach an F1 score of 0.926, clearly surpassing the strongest traditional ML setup, which attains an F1 score of 0.710 using SatCLIP-V+RF+Mean, i.e., location and vision embeddings fed into a Random Forest with mean-based temporal aggregation. Ablation studies demonstrate that ImageNet pretraining (even in the presence of domain shift) and spatial masking enhance performance. In contrast, geospatial foundation model embeddings perform competitively with handcrafted features, suggesting that looting signatures are extremely localized. The repository is available at this https URL.

68. 【2602.19605】CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning

链接https://arxiv.org/abs/2602.19605

作者:Chunlei Meng,Guanhong Huang,Rong Fu,Runmin Jian,Zhongxue Gan,Chun Ouyang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:Multimodal learning aims, Multimodal learning, learning aims, aims to capture, multiple modalities

备注: This study has been Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.

69. 【2602.19596】Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception

链接https://arxiv.org/abs/2602.19596

作者:Yihang Tao,Senkang Hu,Haonan An,Zhengru Fang,Hangcheng Cao,Yuguang Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhance driving safety, enables data sharing, autonomous vehicles, driving safety, sharing among connected

备注: Accepted by CVPR'26

点击查看摘要

Abstract:Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. Our approach combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps and employs entropy-aware vulnerability search to optimize attack location, timing and persistence, enabling adaptive attacks with generalizability across various defensive configurations. Extensive evaluations on OPV2V and Adv-OPV2V datasets demonstrate that MVIG attack reduces defense success rates by up to 62\% against state-of-the-art defenses while achieving 47\% lower detection for persistent attacks at 29.9 FPS, exposing critical security gaps in CP systems. Code will be released at this https URL

70. 【2602.19575】ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization

链接https://arxiv.org/abs/2602.19575

作者:Minseo Kim,Minchan Kwon,Dongyeun Lee,Yunho Jeon,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:irrelevant residual information, generation suffers, information from reference, Personalized, concept

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.

71. 【2602.19571】HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies

链接https://arxiv.org/abs/2602.19571

作者:Chang Liu,Yunfan Ye,Qingyang Zhou,Xichen Tan,Mengxuan Luo,Zhenyu Qiu,Wei Peng,Zhiping Cai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:predictive world modeling, physically grounded intelligence, semantic perception, world modeling, grounded intelligence

备注

点击查看摘要

Abstract:Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.

72. 【2602.19570】VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

链接https://arxiv.org/abs/2602.19570

作者:Nadav Kadvil,Ayellet Tal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, subtly bias, bias their outputs, outputs toward plausible

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.

73. 【2602.19565】DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

链接https://arxiv.org/abs/2602.19565

作者:Li Zhang,Mingyu Mei,Ailing Wang,Xianhui Meng,Yan Zhong,Xinyuan Song,Liu Liu,Rujing Wang,Zaixing He,Cewu Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Articulated object pose, pose estimation, Articulated object, Articulation Pose Estimation, core task

备注

点击查看摘要

Abstract:Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.

74. 【2602.19562】A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

链接https://arxiv.org/abs/2602.19562

作者:Joseph Bingham

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Establishing stable mappings, natural language expressions, artificial intelligence, Establishing stable, natural language

备注: 19 Pages, 6 figures, preprint

点击查看摘要

Abstract:Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65\% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66\% of the time (versus 20\% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at this https URL .

75. 【2602.19549】Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

链接https://arxiv.org/abs/2602.19549

作者:Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Jiahao Huo,Shuliang Liu,James Kwok,Xuming Hu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Visual Document Retrieval, multimodal retrieval applications, Visual Document, current multimodal retrieval, retrieve relevant pages

备注: Under review

点击查看摘要

Abstract:Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

76. 【2602.19542】Vinedresser3D: Agentic Text-guided 3D Editing

链接https://arxiv.org/abs/2602.19542

作者:Yankuan Chi,Xiang Li,Zixuan Huang,James M. Rehg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modify existing, natural-language instructions, aims to modify, editing, editing aims

备注: CVPR 2026, Project website: [this https URL](https://vinedresser3d.github.io/)

点击查看摘要

Abstract:Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the edit region and edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.

77. 【2602.19540】A Green Learning Approach to LDCT Image Restoration

链接https://arxiv.org/abs/2602.19540

作者:Wei Wang,Yixing Wu,C.-C. Jay Kuo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:approach to restore, LDCT, work proposes, green learning, LDCT images

备注: Published in IEEE International Conference on Image Processing (ICIP), 2025, pp. 1762-1767. Final version available at IEEE Xplore

点击查看摘要

Abstract:This work proposes a green learning (GL) approach to restore medical images. Without loss of generality, we use low-dose computed tomography (LDCT) images as examples. LDCT images are susceptible to noise and artifacts, where the imaging process introduces distortion. LDCT image restoration is an important preprocessing step for further medical analysis. Deep learning (DL) methods have been developed to solve this problem. We examine an alternative solution using the Green Learning (GL) methodology. The new restoration method is characterized by mathematical transparency, computational and memory efficiency, and high performance. Experiments show that our GL method offers state-of-the-art restoration performance at a smaller model size and with lower inference complexity.

78. 【2602.19539】Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

链接https://arxiv.org/abs/2602.19539

作者:Xingyu Shen,Tommy Duong,Xiaodong An,Zengqi Zhao,Zebang Hu,Haoyu Hu,Ziyou Wang,Finn Guo,Simiao Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:age-restricted online content, Age estimation systems, online content, systematically evaluated, estimation systems

备注: 13 pages, 6 figures

点击查看摘要

Abstract:Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.

79. 【2602.19536】Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection

链接https://arxiv.org/abs/2602.19536

作者:Zhiwei Ning,Xuanang Gao,Jiaxi Cao,Runze Yang,Huiying Xu,Xinzhong Zhu,Jie Yang,Wei Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Linear modeling methods, previous Mamba-based methods, Mamba-based methods utilize, object detection task, foreground voxels

备注

点击查看摘要

Abstract:Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.

80. 【2602.19530】ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

链接https://arxiv.org/abs/2602.19530

作者:Omprakash Chakraborty,Jose Dolz,Ismail Ben Ayed

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision language models, demonstrated remarkable generalization, Vision language, performance remains constrained, language models

备注

点击查看摘要

Abstract:Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.

81. 【2602.19523】OSInsert: Towards High-authenticity and High-fidelity Image Composition

链接https://arxiv.org/abs/2602.19523

作者:Jingyuan Wang,Li Niu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative image composition, image composition aims, realistic composite image, Generative image, image composition

备注

点击查看摘要

Abstract:Generative image composition aims to regenerate the given foreground object in the background image to produce a realistic composite image. Some high-authenticity methods can adjust foreground pose/view to be compatible with background, while some high-fidelity methods can preserve the foreground details accurately. However, existing methods can hardly achieve both goals at the same time. In this work, we propose a two-stage strategy to achieve both goals. In the first stage, we use high-authenticity method to generate reasonable foreground shape, serving as the condition of high-fidelity method in the second stage. The experiments on MureCOM dataset verify the effectiveness of our two-stage strategy. The code and model have been released at this https URL.

82. 【2602.19517】Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

链接https://arxiv.org/abs/2602.19517

作者:Chongyang Gao,Diji Yang,Shuyan Zhou,Xichen Yan,Luchuan Song,Shuo Li,Kezhen Chen

类目:Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:STEM domains, textbf, CFE, large language models, multimodal benchmark

备注

点击查看摘要

Abstract:We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at this https URL.

83. 【2602.19512】Variational Trajectory Optimization of Anisotropic Diffusion Schedules

链接https://arxiv.org/abs/2602.19512

作者:Pengxi Liu,Zeyu Michael Li,Xiang Cheng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:noise schedules parameterized, theta, matrix-valued noise schedules, introduce a variational, noise schedules

备注

点击查看摘要

Abstract:We introduce a variational framework for diffusion models with anisotropic noise schedules parameterized by a matrix-valued path $M_t(\theta)$ that allocates noise across subspaces. Central to our framework is a trajectory-level objective that jointly trains the score network and learns $M_t(\theta)$, which encompasses general parameterization classes of matrix-valued noise schedules. We further derive an estimator for the derivative with respect to $\theta$ of the score that enables efficient optimization of the $M_t(\theta)$ schedule. For inference, we develop an efficiently-implementable reverse-ODE solver that is an anisotropic generalization of the second-order Heun discretization algorithm. Across CIFAR-10, AFHQv2, FFHQ, and ImageNet-64, our method consistently improves upon the baseline EDM model in all NFE regimes. Code is available at this https URL.

84. 【2602.19506】Relational Feature Caching for Accelerating Diffusion Transformers

链接https://arxiv.org/abs/2602.19506

作者:Byunggwan Son,Jeimin Jeon,Jeongwoo Choi,Bumsub Ham

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:accelerate diffusion transformers, approaches accelerate diffusion, caching approaches accelerate, output features, computationally expensive modules

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at this https URL

85. 【2602.19505】st-Time Computing for Referring Multimodal Large Language Models

链接https://arxiv.org/abs/2602.19505

作者:Mingrui Wu,Hao Chen,Jiayi Ji,Xiaoshuai Sun,Zhiyuan Liu,Liujuan Cao,Ming-Ming Cheng,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:test-time adaptation framework, frozen multimodal large, enable fine-grained region-based, multimodal large language, injects learnable visual

备注: arXiv admin note: substantial text overlap with [arXiv:2407.21534](https://arxiv.org/abs/2407.21534)

点击查看摘要

Abstract:We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at this https URL.

86. 【2602.19503】A Text-Guided Vision Model for Enhanced Recognition of Small Instances

链接https://arxiv.org/abs/2602.19503

作者:Hyun-Ki Jung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately identify specific, identify specific targets, detection technology continues, object detection technology, continues to evolve

备注: Accepted for publication in Applied Computer Science (2026)

点击查看摘要

Abstract:As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its enhanced accuracy. Furthermore, the model demonstrates superior lightweight performance, with the parameter count reduced from 4 million to 3.8 million and FLOPs decreasing from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for precise object detection in drone-based applications.

87. 【2602.19497】MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

链接https://arxiv.org/abs/2602.19497

作者:Mingrui Wu,Hang Liu,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unified Multimodal Models, Recent advancements, enabled remarkable image, remarkable image understanding, advancements in Unified

备注: CVPR2026

点击查看摘要

Abstract:Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: this https URL.

88. 【2602.19487】Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis

链接https://arxiv.org/abs/2602.19487

作者:Weiyi Wu,Xinwen Xu,Chongyang Gao,Xingjian Diao,Siting Li,Jiang Gui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise disease diagnosis, slide images, tissue samples, disease diagnosis, gigapixel-scale panoramas

备注

点击查看摘要

Abstract:Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction.

89. 【2602.19474】Structured Bitmap-to-Mesh Triangulation for Geometry-Aware Discretization of Image-Derived Domains

链接https://arxiv.org/abs/2602.19474

作者:Wei Feng,Haiyong Zheng

类目:Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:regular triangular grid, template-driven triangulation framework, stable PDE discretization, embeds raster, constrained Delaunay triangulation

备注: Revised version after peer review; under review at Graphical Models. Earlier version appeared on SSRN

点击查看摘要

Abstract:We propose a template-driven triangulation framework that embeds raster- or segmentation-derived boundaries into a regular triangular grid for stable PDE discretization on image-derived domains. Unlike constrained Delaunay triangulation (CDT), which may trigger global connectivity updates, our method retriangulates only triangles intersected by the boundary, preserves the base mesh, and supports synchronization-free parallel execution. To ensure determinism and scalability, we classify all local boundary-intersection configurations up to discrete equivalence and triangle symmetries, yielding a finite symbolic lookup table that maps each case to a conflict-free retriangulation template. We prove that the resulting mesh is closed, has bounded angles, and is compatible with cotangent-based discretizations and standard finite element methods. Experiments on elliptic and parabolic PDEs, signal interpolation, and structural metrics show fewer sliver elements, more regular triangles, and improved geometric fidelity near complex boundaries. The framework is well suited for real-time geometric analysis and physically based simulation over image-derived domains.

90. 【2602.19471】Forgetting-Resistant and Lesion-Aware Source-Free Domain Adaptive Fundus Image Analysis with Vision-Language Model

链接https://arxiv.org/abs/2602.19471

作者:Zheang Huai,Hui Tang,Hualiang Wang,Xiaomeng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Source-free domain adaptation, unlabeled target domain, target domain data, Source-free domain, ViL model

备注: 10 pages

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) aims to adapt a model trained in the source domain to perform well in the target domain, with only unlabeled target domain data and the source model. Taking into account that conventional SFDA methods are inevitably error-prone under domain shift, recently greater attention has been directed to SFDA assisted with off-the-shelf foundation models, e.g., vision-language (ViL) models. However, existing works of leveraging ViL models for SFDA confront two issues: (i) Although mutual information is exploited to consider the joint distribution between the predictions of ViL model and the target model, we argue that the forgetting of some superior predictions of the target model still occurs, as indicated by the decline of the accuracies of certain classes during adaptation; (ii) Prior research disregards the rich, fine-grained knowledge embedded in the ViL model, which offers detailed grounding for fundus image diagnosis. In this paper, we introduce a novel forgetting-resistant and lesion-aware (FRLA) method for SFDA of fundus image diagnosis with ViL model. Specifically, a forgetting-resistant adaptation module explicitly preserves the confident predictions of the target model, and a lesion-aware adaptation module yields patch-wise predictions from ViL model and employs them to help the target model be aware of the lesion areas and leverage the ViL model's fine-grained knowledge. Extensive experiments show that our method not only significantly outperforms the vision-language model, but also achieves consistent improvements over the state-of-the-art methods. Our code will be released.

91. 【2602.19470】Physics-informed Active Polarimetric 3D Imaging for Specular Surfaces

链接https://arxiv.org/abs/2602.19470

作者:Jiazhang Wang,Hyelim Yang,Tianyi Wang,Florian Willomitzer

类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

关键词:surfaces remains challenging, real-world scenarios, hand-held scanning, remains challenging, challenging in real-world

备注

点击查看摘要

Abstract:3D imaging of specular surfaces remains challenging in real-world scenarios, such as in-line inspection or hand-held scanning, requiring fast and accurate measurement of complex geometries. Optical metrology techniques such as deflectometry achieve high accuracy but typically rely on multi-shot acquisition, making them unsuitable for dynamic environments. Fourier-based single-shot approaches alleviate this constraint, yet their performance deteriorates when measuring surfaces with high spatial frequency structure or large curvature. Alternatively, polarimetric 3D imaging in computer vision operates in a single-shot fashion and exhibits robustness to geometric complexity. However, its accuracy is fundamentally limited by the orthographic imaging assumption. In this paper, we propose a physics-informed deep learning framework for single-shot 3D imaging of complex specular surfaces. Polarization cues provide orientation priors that assist in interpreting geometric information encoded by structured illumination. These complementary cues are processed through a dual-encoder architecture with mutual feature modulation, allowing the network to resolve their nonlinear coupling and directly infer surface normals. The proposed method achieves accurate and robust normal estimation in single-shot with fast inference, enabling practical 3D imaging of complex specular surfaces.

92. 【2602.19461】Laplacian Multi-scale Flow Matching for Generative Modeling

链接https://arxiv.org/abs/2602.19461

作者:Zelin Zhao,Petr Molodyk,Haotian Xue,Yongxin Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:present Laplacian multiscale, image generative modeling, Laplacian multiscale flow, generative modeling, present Laplacian

备注: Accepted to appear in ICLR 2026

点击查看摘要

Abstract:In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024$\times$1024) while maintaining lower computational overhead.

93. 【2602.19454】HD-TTA: Hypothesis-Driven Test-Time Adaptation for Safer Brain Tumor Segmentation

链接https://arxiv.org/abs/2602.19454

作者:Kartik Jhawar,Lipo Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Standard Test-Time Adaptation, filtered test samples, typically treat inference, Standard Test-Time, methods typically treat

备注: 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Standard Test-Time Adaptation (TTA) methods typically treat inference as a blind optimization task, applying generic objectives to all or filtered test samples. In safety-critical medical segmentation, this lack of selectivity often causes the tumor mask to spill into healthy brain tissue or degrades predictions that were already correct. We propose Hypothesis-Driven TTA, a novel framework that reformulates adaptation as a dynamic decision process. Rather than forcing a single optimization trajectory, our method generates intuitive competing geometric hypotheses: compaction (is the prediction noisy? trim artifacts) versus inflation (is the valid tumor under-segmented? safely inflate to recover). It then employs a representation-guided selector to autonomously identify the safest outcome based on intrinsic texture consistency. Additionally, a pre-screening Gatekeeper prevents negative transfer by skipping adaptation on confident cases. We validate this proof-of-concept on a cross-domain binary brain tumor segmentation task, applying a source model trained on adult BraTS gliomas to unseen pediatric and more challenging meningioma target domains. HD-TTA improves safety-oriented outcomes (Hausdorff Distance (HD95) and Precision) over several state-of-the-art representative baselines in the challenging safety regime, reducing the HD95 by approximately 6.4 mm and improving Precision by over 4%, while maintaining comparable Dice scores. These results demonstrate that resolving the safety-adaptation trade-off via explicit hypothesis selection is a viable, robust path for safe clinical model deployment. Code will be made publicly available upon acceptance.

94. 【2602.19449】Decoupling Vision and Language: Codebook Anchored Visual Adaptation

链接https://arxiv.org/abs/2602.19449

作者:Jason Wu,Tianchen Zhao,Chang Liu,Jiarui Cai,Zheng Zhang,Zhuowei Li,Aaditya Singh,Xiang Xu,Mani Srivastava,Jonathan Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, medical image diagnosis, Large Vision-Language, translate images, medical image

备注: 17 pages, accepted to CVPR2026 main conference

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

95. 【2602.19442】UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

链接https://arxiv.org/abs/2602.19442

作者:Yecheng Zhang,Rong Zhao,Zhizhou Sha,Yong Li,Lei Wang,Ce Hou,Wen Ji,Hao Huang,Yunshan Wan,Jian Yu,Junhao Xia,Yuru Zhang,Chunlei Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aligning vision-language model, typically requires fine-tuning, demand labelled data, domain-specific tasks typically, tasks typically requires

备注: 26 pages

点击查看摘要

Abstract:Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($\kappa=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.

96. 【2602.19437】FinSight-Net:A Physics-Aware Decoupled Network with Frequency-Domain Compensation for Underwater Fish Detection in Smart Aquaculture

链接https://arxiv.org/abs/2602.19437

作者:Jinsong Yang,Zeyuan Hu,Yichen Li,Hong Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Underwater fish detection, marine ecological monitoring, fundamentally limits UFD, Underwater fish, core capability

备注

点击查看摘要

Abstract:Underwater fish detection (UFD) is a core capability for smart aquaculture and marine ecological monitoring. While recent detectors improve accuracy by stacking feature extractors or introducing heavy attention modules, they often incur substantial computational overhead and, more importantly, neglect the physics that fundamentally limits UFD: wavelength-dependent absorption and turbidity-induced scattering significantly degrade contrast, blur fine structures, and introduce backscattering noise, leading to unreliable localization and recognition. To address these challenges, we propose FinSight-Net, an efficient and physics-aware detection framework tailored for complex aquaculture environments. FinSight-Net introduces a Multi-Scale Decoupled Dual-Stream Processing (MS-DDSP) bottleneck that explicitly targets frequency-specific information loss via heterogeneous convolutional branches, suppressing backscattering artifacts while compensating distorted biological cues through scale-aware and channel-weighted pathways. We further design an Efficient Path Aggregation FPN (EPA-FPN) as a detail-filling mechanism: it restores high-frequency spatial information typically attenuated in deep layers by establishing long-range skip connections and pruning redundant fusion routes, enabling robust detection of non-rigid fish targets under severe blur and turbidity. Extensive experiments on DeepFish, AquaFishSet, and our challenging UW-BlurredFish benchmark demonstrate that FinSight-Net achieves state-of-the-art performance. In particular, on UW-BlurredFish, FinSight-Net reaches 92.8% mAP, outperforming YOLOv11s by 4.8% while reducing parameters by 29.0%, providing a strong and lightweight solution for real-time automated monitoring in smart aquaculture.

97. 【2602.19432】CountEx: Fine-Grained Counting via Exemplars and Exclusion

链接https://arxiv.org/abs/2602.19432

作者:Yifeng Huang,Gia Khanh Nguyen,Minh Hoai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visually similar distractors, explicitly exclude visually, exclude visually similar, paper presents CountEx, counting framework designed

备注

点击查看摘要

Abstract:This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at this https URL.

98. 【2602.19430】herA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

链接https://arxiv.org/abs/2602.19430

作者:Dong-Guw Lee,Tai Hyoung Rhee,Hyunsoo Jang,Young-Sik Shin,Ukcheol Shin,Ayoung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale data collection, TIR-based perception, pseudo TIR data, inherent advantages, collection and annotation

备注

点击查看摘要

Abstract:Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

99. 【2602.19424】Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

链接https://arxiv.org/abs/2602.19424

作者:Yuxuan Yang,Zhonghao Yan,Yi Zhang,Bo Yun,Muxi Diao,Guowei Zhao,Kongming Liang,Wenbin Li,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Slide Images, Carcinoma diagnosis relies, gigapixel Whole Slide, Hepatocellular Carcinoma diagnosis, Hepatocellular Carcinoma

备注: 10 pages, 3 figures

点击查看摘要

Abstract:Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at this https URL.

100. 【2602.19423】Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

链接https://arxiv.org/abs/2602.19423

作者:Jiabao Chen,Shan Xiong,Jialin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale electron microscopy, delineating intracellular structures, incurring extensive annotated, extensive annotated data, Domain adaptive segmentation

备注

点击查看摘要

Abstract:Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO) and sparse LPO (SLPO), plug-and-play solutions for alignment with spatially varying human feedback or sparse feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

101. 【2602.19418】PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

链接https://arxiv.org/abs/2602.19418

作者:Hefei Mei,Zirui Wang,Chang Xu,Jianyuan Guo,Minjing Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, modern multimodal applications, Large Vision-Language, Vision-Language Models, multimodal applications

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at this https URL.

102. 【2602.19412】Redefining the Down-Sampling Scheme of U-Net for Precision Biomedical Image Segmentation

链接https://arxiv.org/abs/2602.19412

作者:Mingjie Li,Yizheng Chen,Md Tauhidul Islam,Lei Xing

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:advancing biomedical image, biomedical image segmentation, proposed Stair Pooling, Stair Pooling, instrumental in advancing

备注: AAPM 67th

点击查看摘要

Abstract:U-Net architectures have been instrumental in advancing biomedical image segmentation (BIS) but often struggle with capturing long-range information. One reason is the conventional down-sampling techniques that prioritize computational efficiency at the expense of information retention. This paper introduces a simple but effective strategy, we call it Stair Pooling, which moderates the pace of down-sampling and reduces information loss by leveraging a sequence of concatenated small and narrow pooling operations in varied orientations. Specifically, our method modifies the reduction in dimensionality within each 2D pooling step from $\frac{1}{4}$ to $\frac{1}{2}$. This approach can also be adapted for 3D pooling to preserve even more information. Such preservation aids the U-Net in more effectively reconstructing spatial details during the up-sampling phase, thereby enhancing its ability to capture long-range information and improving segmentation accuracy. Extensive experiments on three BIS benchmarks demonstrate that the proposed Stair Pooling can increase both 2D and 3D U-Net performance by an average of 3.8\% in Dice scores. Moreover, we leverage the transfer entropy to select the optimal down-sampling paths and quantitatively show how the proposed Stair Pooling reduces the information loss.

103. 【2602.19385】Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

链接https://arxiv.org/abs/2602.19385

作者:Minxue Tang,Yangyang Yu,Aolin Ding,Maziyar Baran Pouyan,Taha Belkhouja Yujia Bao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recognizing implicit visual, visual and textual, real-world applications, applications of modern, Recognizing implicit

备注

点击查看摘要

Abstract:Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.

104. 【2602.19380】Detector-in-the-Loop Tracking: Active Memory Rectification for Stable Glottic Opening Localization

链接https://arxiv.org/abs/2602.19380

作者:Huayu Wang,Bahaa Alattar,Cheng-Yen Yang,Hsiang-Wei Huang,Jung Heon Kim,Linda Shapiro,Nathan White,Jenq-Neng Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lacks temporal context, glottic opening localization, opening localization remains, localization remains challenging, remains challenging due

备注: Accepted to Medical Imaging with Deep Learning (MIDL) 2026

点击查看摘要

Abstract:Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. Specifically, in video laryngoscopy, rapid tissue deformation, occlusions, and visual ambiguities in emergency settings require a robust, temporally aware solution that can prevent progressive tracking errors. We propose Closed-Loop Memory Correction (CL-MC), a detector-in-the-loop framework that supervises Segment Anything Model 2(SAM2) through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, effectively mitigating drift accumulation with a training-free foundation tracker in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking. Our code will be available in this https URL.

105. 【2602.19372】Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization

链接https://arxiv.org/abs/2602.19372

作者:Yanting Yang,Shenyuan Gao,Qingwen Bu,Li Chen,Dimitris N.Metaxas

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Solving complex, precise high-level planning, physical interactions, requires a deep, deep understanding

备注: ICRA 2026

点击查看摘要

Abstract:Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.

106. 【2602.19367】me Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

链接https://arxiv.org/abs/2602.19367

作者:Pratham Yashwante,Rose Yu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Platonic Representation Hypothesis, shared latent structure, Representation Hypothesis posits, time series, Platonic Representation

备注: 24 Figures, 12 Tables

点击查看摘要

Abstract:The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.

107. 【2602.19358】Referring Layer Decomposition

链接https://arxiv.org/abs/2602.19358

作者:Fangyi Chen,Yaojie Shen,Lu Xu,Ye Yuan,Shu Zhang,Yulei Niu,Longyin Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:object-aware control, advanced image editing, essential for advanced, Referring Layer Decomposition, editing visual content

备注: ICLR 2026

点击查看摘要

Abstract:Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

108. 【2602.19357】MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations

链接https://arxiv.org/abs/2602.19357

作者:Nilay Yilmaz,Maitreya Patel,Naga Sai Abhiram Kusumba,Yixuan He,Yezhou Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:characteristics of objects, Spatial visualization, Hole Punching tests, spatial characteristics, Spatial

备注

点击查看摘要

Abstract:Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. Also, rotations introduce a significant challenge to the physical situational awareness for models. The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process, with Claude Opus 4.1 achieving the highest planning score at an accuracy of 10\%. The top-performing model, o3, attains a peak performance of 71.6\% on the generalization task, which does not require spatial visualization but transfers spatial data; however, it achieves only 25\% accuracy on text-based prediction tasks.

109. 【2602.19350】PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

链接https://arxiv.org/abs/2602.19350

作者:Zhilin Guo,Jing Yang,Kyle Fogarty,Jingyi Wan,Boqiao Zhang,Tianhao Wu,Weihao Xia,Chenliang Zhou,Sakar Khattar,Fangcheng Zhong,Cristina Nader Vasconcelos,Cengiz Oztireli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesizing photorealistic avatars, Digitizing humans, avatars with explicit, humans and synthesizing, synthesizing photorealistic

备注

点击查看摘要

Abstract:Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.

110. 【2602.19349】UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

链接https://arxiv.org/abs/2602.19349

作者:Rohit Mohan,Florian Drews,Yakov Miron,Daniele Cattaneo,Abhinav Valada

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:critical failure mode, LiDAR-camera fusion enhances, leveraging camera images, sparse LiDAR scans, complement sparse LiDAR

备注

点击查看摘要

Abstract:LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

111. 【2602.19348】MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

链接https://arxiv.org/abs/2602.19348

作者:Sirine Bhouri,Lan Wei,Jian-Qing Zheng,Dandan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Acquiring aligned visuo-tactile, requiring specialised hardware, Acquiring aligned, large-scale data collection, aligned visuo-tactile datasets

备注: Accepted by 2026 ICRA

点击查看摘要

Abstract:Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

112. 【2602.19324】RetinaVision: XAI-Driven Augmented Regulation for Precise Retinal Disease Classification using deep learning framework

链接https://arxiv.org/abs/2602.19324

作者:Mohammad Tahmid Noor,Shayan Abrar,Jannatul Adan Mahi,Md Parvez Mia,Asaduzzaman Hridoy,Samanta Ghosh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:counter vision loss, retinal disease classification, OCT retinal disease, guiding clinical management, retinal diseases

备注: 6 pages, 15 figures

点击查看摘要

Abstract:Early and accurate classification of retinal diseases is critical to counter vision loss and for guiding clinical management of retinal diseases. In this study, we proposed a deep learning method for retinal disease classification utilizing optical coherence tomography (OCT) images from the Retinal OCT Image Classification - C8 dataset (comprising 24,000 labeled images spanning eight conditions). Images were resized to 224x224 px and tested on convolutional neural network (CNN) architectures: Xception and InceptionV3. Data augmentation techniques (CutMix, MixUp) were employed to enhance model generalization. Additionally, we applied GradCAM and LIME for interpretability evaluation. We implemented this in a real-world scenario via our web application named RetinaVision. This study found that Xception was the most accurate network (95.25%), followed closely by InceptionV3 (94.82%). These results suggest that deep learning methods allow effective OCT retinal disease classification and highlight the importance of implementing accuracy and interpretability for clinical applications.

113. 【2602.19323】DefenseSplat: Enhancing the Robustness of 3D Gaussian Splatting via Frequency-Aware Filtering

链接https://arxiv.org/abs/2602.19323

作者:Yiran Qiao,Yiren Lu,Yunlai Zhou,Rui Yang,Linlin Hou,Yu Yin,Jing Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-time and high-fidelity, Gaussian Splatting, powerful paradigm, paradigm for real-time, posed images

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for real-time and high-fidelity 3D reconstruction from posed images. However, recent studies reveal its vulnerability to adversarial corruptions in input views, where imperceptible yet consistent perturbations can drastically degrade rendering quality, increase training and rendering time, and inflate memory usage, even leading to server denial-of-service. In our work, to mitigate this issue, we begin by analyzing the distinct behaviors of adversarial perturbations in the low- and high-frequency components of input images using wavelet transforms. Based on this observation, we design a simple yet effective frequency-aware defense strategy that reconstructs training views by filtering high-frequency noise while preserving low-frequency content. This approach effectively suppresses adversarial artifacts while maintaining the authenticity of the original scene. Notably, it does not significantly impair training on clean data, achieving a desirable trade-off between robustness and performance on clean inputs. Through extensive experiments under a wide range of attack intensities on multiple benchmarks, we demonstrate that our method substantially enhances the robustness of 3DGS without access to clean ground-truth supervision. By highlighting and addressing the overlooked vulnerabilities of 3D Gaussian Splatting, our work paves the way for more robust and secure 3D reconstructions.

114. 【2602.19322】US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

链接https://arxiv.org/abs/2602.19322

作者:Ashwath Radhachandran,Vedrana Ivezić,Shreeram Athreya,Ronit Anilkumar,Corey W. Arnold,William Speier

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:imaging poses unique, noisy acquisition process, poses unique challenges, inherently noisy acquisition, imaging poses

备注

点击查看摘要

Abstract:Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

115. 【2602.19316】Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

链接https://arxiv.org/abs/2602.19316

作者:Alexandros Haliassos,Rodrigo Mira,Stavros Petridis

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Unified Speech Recognition, audiovisual speech recognition, Speech Recognition, Unified Speech, audiovisual speech

备注: ICLR 2026. Code: [this https URL](https://github.com/ahaliassos/usr2)

点击查看摘要

Abstract:Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.

116. 【2602.19314】IPv2: An Improved Image Purification Strategy for Real-World Ultra-Low-Dose Lung CT Denoising

链接https://arxiv.org/abs/2602.19314

作者:Guoliang Gong,Man Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:aligned anatomical structures, structural preservation ability, anatomical structures, purification strategy constructs, image purification strategy

备注

点击查看摘要

Abstract:The image purification strategy constructs an intermediate distribution with aligned anatomical structures, which effectively corrects the spatial misalignment between real-world ultra-low-dose CT and normal-dose CT images and significantly enhances the structural preservation ability of denoising models. However, this strategy exhibits two inherent limitations. First, it suppresses noise only in the chest wall and bone regions while leaving the image background untreated. Second, it lacks a dedicated mechanism for denoising the lung parenchyma. To address these issues, we systematically redesign the original image purification strategy and propose an improved version termed IPv2. The proposed strategy introduces three core modules, namely Remove Background, Add noise, and Remove noise. These modules endow the model with denoising capability in both background and lung tissue regions during training data construction and provide a more reasonable evaluation protocol through refined label construction at the testing stage. Extensive experiments on our previously established real-world patient lung CT dataset acquired at 2% radiation dose demonstrate that IPv2 consistently improves background suppression and lung parenchyma restoration across multiple mainstream denoising models. The code is publicly available at this https URL.

117. 【2602.19308】WildOS: Open-Vocabulary Object Search in the Wild

链接https://arxiv.org/abs/2602.19308

作者:Hardik Shah,Erica Tevere,Deegan Atha,Marcel Kaufmann,Shehryar Khattak,Manthan Patel,Marco Hutter,Jonas Frey,Patrick Spieler

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:unstructured outdoor environments, outdoor environments requires, limited depth sensing, environments requires robots, unstructured outdoor

备注: 28 pages, 16 figures, 2 tables

点击查看摘要

Abstract:Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot's immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded. Project Page: this https URL

118. 【2602.19285】MRI Contrast Enhancement Kinetics World Model

链接https://arxiv.org/abs/2602.19285

作者:Jindi Kong,Yuting He,Cong Xia,Rongjun Ge,Shuo Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Clinical MRI contrast, inefficient information yield, costly acquisition protocol, contrast acquisition suffers, MRI contrast acquisition

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and the fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in the human body enables continuous contrast-free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to a sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence of data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by the spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follow a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrains smooth variations in the latent space among interpolated sequences. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available at this https URL.

119. 【2602.19278】A Two-Stage Detection-Tracking Framework for Stable Apple Quality Inspection in Dense Conveyor-Belt Environments

链接https://arxiv.org/abs/2602.19278

作者:Keonvin Park,Aditya Pal,Jin Hong Mok

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dense multi-object interactions, existing works evaluate, works evaluate detection, continuous motion, video streams

备注

点击查看摘要

Abstract:Industrial fruit inspection systems must operate reliably under dense multi-object interactions and continuous motion, yet most existing works evaluate detection or classification at the image level without ensuring temporal stability in video streams. We present a two-stage detection-tracking framework for stable multi-apple quality inspection in conveyor-belt environments. An orchard-trained YOLOv8 model performs apple localization, followed by ByteTrack multi-object tracking to maintain persistent identities. A ResNet18 defect classifier, fine-tuned on a healthy-defective fruit dataset, is applied to cropped apple regions. Track-level aggregation is introduced to enforce temporal consistency and reduce prediction oscillation across frames. We define video-level industrial metrics such as track-level defect ratio and temporal consistency to evaluate system robustness under realistic processing conditions. Results demonstrate improved stability compared to frame-wise inference, suggesting that integrating tracking is essential for practical automated fruit grading systems.

120. 【2602.19274】DD-CAM: Minimal Sufficient Explanations for Vision Models Using Delta Debugging

链接https://arxiv.org/abs/2602.19274

作者:Krishna Khadka,Yu Lei,Raghu N. Kacker,D. Richard Kuhn

类目:Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)

关键词:activation preserves predictions, joint activation preserves, joint activation, joint activation suffices, introduce a gradient-free

备注

点击查看摘要

Abstract:We introduce a gradient-free framework for identifying minimal, sufficient, and decision-preserving explanations in vision models by isolating the smallest subset of representational units whose joint activation preserves predictions. Unlike existing approaches that aggregate all units, often leading to cluttered saliency maps, our approach, DD-CAM, identifies a 1-minimal subset whose joint activation suffices to preserve the prediction (i.e., removing any unit from the subset alters the prediction). To efficiently isolate minimal sufficient subsets, we adapt delta debugging, a systematic reduction strategy from software debugging, and configure its search strategy based on unit interactions in the classifier head: testing individual units for models with non-interacting units and testing unit combinations for models in which unit interactions exist. We then generate minimal, prediction-preserving saliency maps that highlight only the most essential features. Our experimental evaluation demonstrates that our approach can produce more faithful explanations and achieve higher localization accuracy than the state-of-the-art CAM-based approaches.

121. 【2602.19268】CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

链接https://arxiv.org/abs/2602.19268

作者:Sonu Kumar,Mohd Faisal Khan,Mukul Lokhande,Santosh Kumar Vishvakarma

类目:Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)

关键词:iterative CORDIC-based MAC, performance-enhanced vector engine, CORDIC-based MAC unit, vector engine featuring, presents a runtime-adaptive

备注

点击查看摘要

Abstract:This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration. The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resource-efficient approach further enables up to 4x throughput improvement within the same hardware resources by leveraging vectorised, time-multiplexed execution and flexible precision scaling. With a time-multiplexed multi-AF block and a lightweight pooling and normalisation unit, the proposed vector engine supports flexible precision (4/8/16-bit) and high MAC density. The ASIC implementation results show that each MAC stage can save up to 33% of time and 21% of power, with a 256-PE configuration that achieves higher compute density (4.83 TOPS/mm2 ) and energy efficiency (11.67 TOPS/W) than previous state-of-the-art work. A detailed hardware-software co-design methodology for object detection and classification tasks on Pynq-Z2 is discussed to assess the proposed architecture, demonstrating a scalable, energy-efficient solution for edge AI applications.

122. 【2602.19254】RegionRoute: Regional Style Transfer with Diffusion Model

链接https://arxiv.org/abs/2602.19254

作者:Bowen Chen,Jake Zuena,Alan C. Bovik,Divya Kothandaraman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Precise spatial control, transfer remains challenging, Precise spatial, style transfer remains, remains challenging

备注

点击查看摘要

Abstract:Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

123. 【2602.19248】No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

链接https://arxiv.org/abs/2602.19248

作者:Zunkai Dai,Ke Li,Jiajia Liu,Jie Yang,Yuanyuan Qiao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:challenging problem due, video anomaly detection, challenging problem, problem due, rare occurrence

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in this https URL.

124. 【2602.19224】Knowledge-aware Visual Question Generation for Remote Sensing Images

链接https://arxiv.org/abs/2602.19224

作者:Siran Li,Li Mi,Javiera Castillo-Navarro,Devis Tuia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gathering specific information, performing image retrieval, sensing image archives, rapid development, gathering specific

备注

点击查看摘要

Abstract:With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing image retrieval. However, automatically generated image-based questions tend to be simplistic and template-based, which hinders the real deployment of question answering or visual dialogue systems. To enrich and diversify the questions, we propose a knowledge-aware remote sensing visual question generation model, KRSVQG, that incorporates external knowledge related to the image content to improve the quality and contextual understanding of the generated questions. The model takes an image and a related knowledge triplet from external knowledge sources as inputs and leverages image captioning as an intermediary representation to enhance the image grounding of the generated questions. To assess the performance of KRSVQG, we utilized two datasets that we manually annotated: NWPU-300 and TextRS-300. Results on these two datasets demonstrate that KRSVQG outperforms existing methods and leads to knowledge-enriched questions, grounded in both image and domain knowledge.

125. 【2602.19219】Controlled Face Manipulation and Synthesis for Data Augmentation

链接https://arxiv.org/abs/2602.19219

作者:Joris Kirchner,Amogh Gudi,Marian Bittner,Chirag Raman

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep learning vision, Deep learning, vision models excel, learning vision models, applications face label

备注

点击查看摘要

Abstract:Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder). Using lightweight linear models, we reduce entanglement of semantic features via (i) dependency-aware conditioning that accounts for AU co-activation, and (ii) orthogonal projection that removes nuisance attribute directions (e.g., glasses), together with an expression neutralization step to enable absolute AU edit. We use these edits to balance AU occurrence by editing labeled faces and to diversify identities/demographics via controlled synthesis. Augmenting AU detector training with the generated data improves accuracy and yields more disentangled predictions with fewer co-activation shortcuts, outperforming alternative data-efficient training strategies and suggesting improvements similar to what would require substantially more labeled data in our learning-curve analysis. Compared to prior methods, our edits are stronger, produce fewer artifacts, and preserve identity better.

126. 【2602.19217】Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing

链接https://arxiv.org/abs/2602.19217

作者:Siran Li,Li Mi,Javiera Castillo-Navarro,Devis Tuia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gathering specific information, Remote Sensing Visual, Knowledge-aware Remote Sensing, Sensing Visual Question, semantic image retrieval

备注

点击查看摘要

Abstract:With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model's adaptation to low data regimes. To evaluate the proposed KRSVQG model, we construct two knowledge-aware remote sensing visual question generation datasets: the NWPU-300 dataset and the TextRS-300 dataset. Evaluations, including metrics and human assessment, demonstrate that KRSVQG outperforms existing methods and leads to rich questions, grounded in both image and domain knowledge. As a key practice in vision-language research, knowledge-aware visual question generation advances the understanding of image content beyond pixels, facilitating the development of knowledge-enriched vision-language systems with vision-grounded human commonsense.

127. 【2602.19213】SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

链接https://arxiv.org/abs/2602.19213

作者:Yujie Lu,Jingwen Li,Sibo Ju,Yanzhou Su,he yao,Yisong Liu,Min Zhu,Junlong Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging due, Medical image segmentation, quantitative analysis, diagnosis and quantitative, remains challenging

备注

点击查看摘要

Abstract:Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

128. 【2602.19206】GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

链接https://arxiv.org/abs/2602.19206

作者:Zehao Deng,An Liu,Yan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:data privacy concerns, target training data, Synergistic View Representation, View Representation Learning, privacy concerns

备注

点击查看摘要

Abstract:Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at this https URL.

129. 【2602.19202】UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

链接https://arxiv.org/abs/2602.19202

作者:Gang Xu,Zhiyu Zhu,Junhui Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Event cameras excel, scene perception, excel at high-speed, cameras excel, video

备注

点击查看摘要

Abstract:Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at this https URL.

130. 【2602.19198】Prompt Tuning for CLIP on the Pretrained Manifold

链接https://arxiv.org/abs/2602.19198

作者:Xi Yang,Yuanrong Xu,Weigang Zhang,Guangming Lu,David Zhang,Jie Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learnable prompt vectors, adapt pretrained vision-language, pretrained vision-language models, parameter-efficient manner, Prompt tuning

备注

点击查看摘要

Abstract:Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.

131. 【2602.19190】FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

链接https://arxiv.org/abs/2602.19190

作者:Xiaokun Zhang,Yi Yang,Ziqi Ye,Baiyun,Xiaorong Guo,Qingchen Fang,Ruyi Zhang,Xinpeng Zhou,Haipeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Synthetic Aperture Radar, all-time Synthetic Aperture, Aperture Radar, Synthetic Aperture, all-time Synthetic

备注

点击查看摘要

Abstract:Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.

132. 【2602.19188】PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

链接https://arxiv.org/abs/2602.19188

作者:Chen Duan,Zhentao Guo,Pei Fu,Zining Wang,Kai Zhou,Pengfei Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language Model, Visual Question Answering, Multi-modal Large Language, Large Language

备注

点击查看摘要

Abstract:In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.

133. 【2602.19180】VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

链接https://arxiv.org/abs/2602.19180

作者:Wenhao Shen,Hao Wang,Wanqi Yin,Fayao Liu,Xulei Yang,Chao Liang,Zhongang Cai,Guosheng Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single RGB image, single RGB, Human mesh recovery, mesh recovery, inherently ambiguous

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.

134. 【2602.19178】EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

链接https://arxiv.org/abs/2602.19178

作者:Qiuhui Chen,Xuancheng Yao,Zhenglei Zhou,Xinyue Hu,Yi Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning models, Deep learning, explicitly linking decisions, medical image analysis, black boxes

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

135. 【2602.19170】BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment

链接https://arxiv.org/abs/2602.19170

作者:Kanglei Zhou,Chang Li,Qingyi Pan,Liyuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Action Quality Assessment, human skill evaluation, Quality Assessment, Action Quality, rehabilitation assessment

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce Bridged Modality Adaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6--8\% higher correlation and 12--15\% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.

136. 【2602.19163】JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

链接https://arxiv.org/abs/2602.19163

作者:Kai Liu,Yanhao Zheng,Kai Wang,Shengqiong Wu,Rongjunchen Zhang,Jiebo Luo,Dimitrios Hatzinakos,Ziwei Liu,Hao Fei,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:high-quality multimodal synthesis, AIGC has rapidly, rapidly expanded, high-quality multimodal, multimodal synthesis

备注: Accepted by ICLR 2026. Homepage: [this https URL](https://JavisVerse.github.io/JavisDiT2-page)

点击查看摘要

Abstract:AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at this https URL.

137. 【2602.19161】Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

链接https://arxiv.org/abs/2602.19161

作者:Lunjie Zhu,Yushi Huang,Xingtong Ge,Yufei Xue,Zhening Liu,Yumeng Zhang,Zehong Lin,Jun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-quality video synthesis, enabled high-quality video, Latent diffusion models, inference remains costly, VAE decoders

备注: Code will be released at [this https URL](https://github.com/Aoko955/Flash-VAED)

点击查看摘要

Abstract:Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

138. 【2602.19156】Artefact-Aware Fungal Detection in Dermatophytosis: A Real-Time Transformer-Based Approach for KOH Microscopy

链接https://arxiv.org/abs/2602.19156

作者:Rana Gursoy,Abdurrahim Yilmaz,Baris Kizilyaprak,Esmahan Caglar,Burak Temelkuran,Huseyin Uvet,Ayse Esra Koku Aksu,Gulsum Gencoglan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:heterogeneous keratin clearance, notable inter-observer variability, Dermatophytosis is commonly, potassium hydroxide, heterogeneous keratin

备注

点击查看摘要

Abstract:Dermatophytosis is commonly assessed using potassium hydroxide (KOH) microscopy, yet accurate recognition of fungal hyphae is hindered by artefacts, heterogeneous keratin clearance, and notable inter-observer variability. This study presents a transformer-based detection framework using the RT-DETR model architecture to achieve precise, query-driven localization of fungal structures in high-resolution KOH images. A dataset of 2,540 routinely acquired microscopy images was manually annotated using a multi-class strategy to explicitly distinguish fungal elements from confounding artefacts. The model was trained with morphology-preserving augmentations to maintain the structural integrity of thin hyphae. Evaluation on an independent test set demonstrated robust object-level performance, with a recall of 0.9737, precision of 0.8043, and an AP@0.50 of 93.56%. When aggregated for image-level diagnosis, the model achieved 100% sensitivity and 98.8% accuracy, correctly identifying all positive cases without missing a single diagnosis. Qualitative outputs confirmed the robust localization of low-contrast hyphae even in artefact-rich fields. These results highlight that an artificial intelligence (AI) system can serve as a highly reliable, automated screening tool, effectively bridging the gap between image-level analysis and clinical decision-making in dermatomycology.

139. 【2602.19146】VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

链接https://arxiv.org/abs/2602.19146

作者:Diogo Glória-Silva,David Semedo,João Maglhães

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:multi-step instructional video, instructional video action, Instructional Video Dialogues, reason over complex, dialogue model designed

备注: Accepted at EACL 2026 Findings

点击查看摘要

Abstract:We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

140. 【2602.19140】CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion

链接https://arxiv.org/abs/2602.19140

作者:Sijie Mai,Shiqin Han

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:gap significantly restricts, Modality gap significantly, Modality gap, Modality, rectified flow

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the `one-to-many mapping' strategy in rectified flow that allows each data point of the source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce `cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce the modality gap.

141. 【2602.19134】Mapping Networks

链接https://arxiv.org/abs/2602.19134

作者:Lord Sen,Shyamapada Mukherjee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern deep learning, deep learning models, learning models pose, escalating parameter counts, counts in modern

备注: 10 pages

点击查看摘要

Abstract:The escalating parameter counts in modern deep learning models pose a fundamental challenge to efficient training and resolution of overfitting. We address this by introducing the \emph{Mapping Networks} which replace the high dimensional weight space by a compact, trainable latent vector based on the hypothesis that the trained parameters of large networks reside on smooth, low-dimensional manifolds. Henceforth, the Mapping Theorem enforced by a dedicated Mapping Loss, shows the existence of a mapping from this latent space to the target weight space both theoretically and in practice. Mapping Networks significantly reduce overfitting and achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc, with $\mathbf{99.5\%}$, i.e., around $500\times$ reduction in trainable parameters.

142. 【2602.19123】StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification

链接https://arxiv.org/abs/2602.19123

作者:Jiapeng Li,Yingjing Huang,Fan Zhang,Yu liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:urban ecosystem services, ecosystem services, street trees, crucial task, fine-grained street tree

备注

点击查看摘要

Abstract:The fine-grained classification of street trees is a crucial task for urban planning, streetscape management, and the assessment of urban ecosystem services. However, progress in this field has been significantly hindered by the lack of large-scale, geographically diverse, and publicly available benchmark datasets specifically designed for street trees. To address this critical gap, we introduce StreetTree, the world's first large-scale benchmark dataset dedicated to fine-grained street tree classification. The dataset contains over 12 million images covering more than 8,300 common street tree species, collected from urban streetscapes across 133 countries spanning five continents, and supplemented with expert-verified observational data. StreetTree poses substantial challenges for pretrained vision models under complex urban environments: high inter-species visual similarity, long-tailed natural distributions, significant intra-class variations caused by seasonal changes, and diverse imaging conditions such as lighting, occlusions from buildings, and varying camera angles. In addition, we provide a hierarchical taxonomy (order-family-genus-species) to support research in hierarchical classification and representation learning. Through extensive experiments with various visual models, we establish strong baselines and reveal the limitations of existing methods in handling such real-world complexities. We believe that StreetTree will serve as a key resource for the refined management and research of urban street trees, while also driving new advancements at the intersection of computer vision and urban science.

143. 【2602.19117】Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

链接https://arxiv.org/abs/2602.19117

作者:Jaeyun Jang,Seunghui Shin,Taeho Park,Hyoseok Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding spatial relationships, involves understanding spatial, specific viewpoints-either egocentric, reasoning involves understanding, Symbolic Projective Layout

备注: To appear in CVPR 2026

点击查看摘要

Abstract:Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.

144. 【2602.19112】Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

链接https://arxiv.org/abs/2602.19112

作者:Qinfeng Xiao,Guofeng Mei,Bo Yang,Liying Zhang,Jian Zhang,Kit-lun Yick

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:homogeneous subject types, prior approaches depend, Establishing dense correspondences, Establishing dense, subject types

备注: Accepted CVPR 2026

点击查看摘要

Abstract:Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.

145. 【2602.19091】CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

链接https://arxiv.org/abs/2602.19091

作者:Lihao Liu,Yan Wang,Biao Yang,Da Li,Jiangxia Cao,Yuxiao Luo,Xiang Chen,Xiangyu Wu,Wei Yuan,Fan Yang,Guiguang Ding,Tingting Gao,Guorui Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Large Language, visual question answering, shown remarkable success

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt design with learnable chorus tokens to aggregate multimodal semantics and a compression-driven training strategy that integrates contrastive and generative objectives through compression-aware attention. Extensive experiments demonstrate that CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks. Our findings highlight that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.

146. 【2602.19089】Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling

链接https://arxiv.org/abs/2602.19089

作者:Qi Sun,Can Wang,Jiaxiang Shang,Yingchun Liu,Jing Liao

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:lack non-rigid dynamics, kinematics-based approaches lack, approaches lack non-rigid, clothing dynamics, video diffusion priors

备注: CVPR 2026

点击查看摘要

Abstract:Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion from residual non-rigid motion. Rigid motion is generated by a kinematic method, which then produces a coarse rendering to guide the video diffusion model in generating video sequences that restore the residual non-rigid motion. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, we propose a novel self-guided stochastic sampling method, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of the residual non-rigid motion field. Extensive experiments demonstrate that \MethodName can generate photorealistic 3D human animation, outperforming existing methods. Code is available in this https URL.

147. 【2602.19086】Restoration-Guided Kuzushiji Character Recognition Framework under Seal Interference

链接https://arxiv.org/abs/2602.19086

作者:Rui-Yang Ju,Kohei Yamashita,Hirotaka Kameko,Shinsuke Mori

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:popular writing styles, pre-modern Japan, Kuzushiji, Kuzushiji character, Kuzushiji character recognition

备注

点击查看摘要

Abstract:Kuzushiji was one of the most popular writing styles in pre-modern Japan and was widely used in both personal letters and official documents. However, due to its highly cursive forms and extensive glyph variations, most modern Japanese readers cannot directly interpret Kuzushiji characters. Therefore, recent research has focused on developing automated Kuzushiji character recognition methods, which have achieved satisfactory performance on relatively clean Kuzushiji document images. However, existing methods struggle to maintain recognition accuracy under seal interference (e.g., when seals overlap characters), despite the frequent occurrence of seals in pre-modern Japanese documents. To address this challenge, we propose a three-stage restoration-guided Kuzushiji character recognition (RG-KCR) framework specifically designed to mitigate seal interference. We construct datasets for evaluating Kuzushiji character detection (Stage 1) and classification (Stage 3). Experimental results show that the YOLOv12-medium model achieves a precision of 98.0% and a recall of 93.3% on the constructed test set. We quantitatively evaluate the restoration performance of Stage 2 using PSNR and SSIM. In addition, we conduct an ablation study to demonstrate that Stage 2 improves the Top-1 accuracy of Metom, a Vision Transformer (ViT)-based Kuzushiji classifier employed in Stage 3, from 93.45% to 95.33%. The implementation code of this work is available at this https URL.

148. 【2602.19083】ChordEdit: One-Step Low-Energy Transport for Image Editing

链接https://arxiv.org/abs/2602.19083

作者:Liangsi Lu,Xuhang Chen,Minzhe Guo,Shichu Li,Jingchao Wang,Yang Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unprecedented synthesis speed, offers unprecedented synthesis, models offers unprecedented, synthesis speed, offers unprecedented

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce ChordEdit, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.

149. 【2602.19064】L3DR: 3D-aware LiDAR Diffusion and Rectification

链接https://arxiv.org/abs/2602.19064

作者:Quan Liu,Xiaoqin Zhang,Ling Shao,Shijian Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently made huge, made huge strides, based LiDAR diffusion, LiDAR diffusion, recently made

备注: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:Range-view (RV) based LiDAR diffusion has recently made huge strides towards 2D photo-realism. However, it neglects 3D geometry realism and often generates various RV artifacts such as depth bleeding and wavy surfaces. We design L3DR, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately. Our theoretical and empirical analysis reveals that 3D models are inherently superior to 2D models in generating sharp and authentic boundaries. Leveraging such analysis, we design a 3D residual regression network that rectifies RV artifacts and achieves superb geometry realism by predicting point-level offsets in 3D space. On top of that, we design a Welsch Loss that helps focus on local geometry and ignore anomalous regions effectively. Extensive experiments over multiple benchmarks including KITTI, KITTI360, nuScenes and Waymo show that the proposed L3DR achieves state-of-the-art generation and superior geometry-realism consistently. In addition, L3DR is generally applicable to different LiDAR diffusion models with little computational overhead.

150. 【2602.19063】Direction-aware 3D Large Multimodal Models

链接https://arxiv.org/abs/2602.19063

作者:Quan Liu,Weihao Xuan,Junjue Wang,Naoto Yokoya,Ling Shao,Shijian Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large multimodal models, enabling directional question-answering, ego poses, large multimodal, large multimodal modelling

备注: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.

151. 【2602.19053】Flow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

链接https://arxiv.org/abs/2602.19053

作者:Qingwen Zhang,Chenhan Jiang,Xiaomeng Zhu,Yunqi Miao,Yushan Zhang,Olov Andersson,Patric Jensfelt

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:offer real-time efficiency, scene flow estimation, flow estimation offer, estimation offer real-time, real-time efficiency

备注: CVPR 2026; 15 pages, 8 figures

点击查看摘要

Abstract:Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals. In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames. Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of up to 33\% on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up 150 times. The code is open-sourced at this https URL along with trained model weights.

152. 【2602.19035】OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

链接https://arxiv.org/abs/2602.19035

作者:Phuc D.A. Nguyen,Anh N. Nhu,Ming C. Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Open-world Visual Odometry, Visual Odometry, Open-world Visual, limited input conditions, input conditions

备注: Main paper CVPR 2026

点击查看摘要

Abstract:We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.

153. 【2602.19033】A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse

链接https://arxiv.org/abs/2602.19033

作者:Vibhas Kumar Vats,David J. Crandall,Samuel Goree

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:training datasets, datasets will inevitably, inevitably contain AI-generated, feedback, impacts the training

备注: A preprint -- Under review

点击查看摘要

Abstract:AI training datasets will inevitably contain AI-generated examples, leading to ``feedback'' in which the output of one model impacts the training of another. It is known that such iterative feedback can lead to model collapse, yet the mechanisms underlying this degeneration remain poorly understood. Here we show that a broad class of feedback processes converges to a low-dimensional invariant structure in latent space, a phenomenon we call neural resonance. By modeling iterative feedback as a Markov Chain, we show that two conditions are needed for this resonance to occur: ergodicity of the feedback process and directional contraction of the latent representation. By studying diffusion models on MNIST and ImageNet, as well as CycleGAN and an audio feedback experiment, we map how local and global manifold geometry evolve, and we introduce an eight-pattern taxonomy of collapse behaviors. Neural resonance provides a unified explanation for long-term degenerate behavior in generative models and provides practical diagnostics for identifying, characterizing, and eventually mitigating collapse.

154. 【2602.19024】owards Calibrating Prompt Tuning of Vision-Language Models

链接https://arxiv.org/abs/2602.19024

作者:Ashshak Sharifdeen,Fahad Shamshad,Muhammad Akhtar Munir,Abhishek Basu,Mohamed Insaf Ismithdeen,Jeyapriyan Jeyamohan,Chathurika Sewwandi Silva,Karthik Nandakumar,Muhammad Haris Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:updating model weights, large-scale vision-language models, enables efficient task, efficient task adaptation, CLIP enables efficient

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes

155. 【2602.19022】An interpretable framework using foundation models for fish sex identification

链接https://arxiv.org/abs/2602.19022

作者:Zheng Miao,Tien-Chieh Hung

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurate sex identification, Accurate sex, strategies in aquaculture, sex identification, vital for optimizing

备注

点击查看摘要

Abstract:Accurate sex identification in fish is vital for optimizing breeding and management strategies in aquaculture, particularly for species at the risk of extinction. However, most existing methods are invasive or stressful and may cause additional mortality, posing severe risks to threatened or endangered fish populations. To address these challenges, we propose FishProtoNet, a robust, non-invasive computer vision-based framework for sex identification of delta smelt (Hypomesus transpacificus), an endangered fish species native to California, across its full life cycle. Unlike the traditional deep learning methods, FishProtoNet provides interpretability through learned prototype representations while improving robustness by leveraging foundation models to reduce the influence of background noise. Specifically, the FishProtoNet framework consists of three key components: fish regions of interest (ROIs) extraction using visual foundation model, feature extraction from fish ROIs and fish sex identification based on an interpretable prototype network. FishProtoNet demonstrates strong performance in delta smelt sex identification during early spawning and post-spawning stages, achieving the accuracies of 74.40% and 81.16% and corresponding F1 scores of 74.27% and 79.43% respectively. In contrast, delta smelt sex identification at the subadult stage remains challenging for current computer vision methods, likely due to less pronounced morphological differences in immature fish. The source code of FishProtoNet is publicly available at: this https URL

156. 【2602.19019】okenTrace: Multi-Concept Attribution through Watermarked Token Recovery

链接https://arxiv.org/abs/2602.19019

作者:Li Zhang,Shruti Agarwal,John Collomosse,Pengtao Xie,Vishal Asnani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:replicate unique artistic, unique artistic styles, intellectual property, pose a significant, significant challenge

备注

点击查看摘要

Abstract:Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multi-concept attribution. Our method embeds secret signatures into the semantic domain by simultaneously perturbing the text prompt embedding and the initial latent noise that guide the diffusion model's generation process. For retrieval, we propose a query-based TokenTrace module that takes the generated image and a textual query specifying which concepts need to be retrieved (e.g., a specific object or style) as inputs. This query-based mechanism allows the module to disentangle and independently verify the presence of multiple concepts from a single generated image. Extensive experiments show that our method achieves state-of-the-art performance on both single-concept (object and style) and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.

157. 【2602.19005】GUIDE-US: Grade-Informed Unpaired Distillation of Encoder Knowledge from Histopathology to Micro-UltraSound

链接https://arxiv.org/abs/2602.19005

作者:Emma Willis,Tarek Elghareb,Paul F. R. Wilson,Minh Nguyen Nhat To,Mohammad Mahdi Abootorabi,Amoon Jamzad,Brian Wodlinger,Parvin Mousavi,Purang Abolmaesumi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:infer tissue micro-structure, Non-invasive grading, coarse imaging resolutions, aggressive regions, current models struggle

备注: Accepted to IPCAI 2026

点击查看摘要

Abstract:Purpose: Non-invasive grading of prostate cancer (PCa) from micro-ultrasound (micro-US) could expedite triage and guide biopsies toward the most aggressive regions, yet current models struggle to infer tissue micro-structure at coarse imaging resolutions. Methods: We introduce an unpaired histopathology knowledge-distillation strategy that trains a micro-US encoder to emulate the embedding distribution of a pretrained histopathology foundation model, conditioned on International Society of Urological Pathology (ISUP) grades. Training requires no patient-level pairing or image registration, and histopathology inputs are not used at inference. Results: Compared to the current state of the art, our approach increases sensitivity to clinically significant PCa (csPCa) at 60% specificity by 3.5% and improves overall sensitivity at 60% specificity by 1.2%. Conclusion: By enabling earlier and more dependable cancer risk stratification solely from imaging, our method advances clinical feasibility. Source code will be publicly released upon publication.

Comments:
Accepted to IPCAI 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.19005 [cs.CV]

(or
arXiv:2602.19005v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.19005

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Mohammad Mahdi Abootorabi [view email] [v1]
Sun, 22 Feb 2026 02:02:36 UTC (3,570 KB)

158. 【2602.19004】MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

链接https://arxiv.org/abs/2602.19004

作者:Duc Duy Nguyen,Tat-Jun Chin,Minh Hoai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inertial measurement unit, enabling accurate cross-modal, accurate cross-modal retrieval, pose sequences extracted, measurement unit

备注: 8 pages, 6 tables, 7 figures, accepted to CVPR26

点击查看摘要

Abstract:We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at this https URL MoBind.

159. 【2602.19001】A Benchmark and Knowledge-Grounded Framework for Advanced Multimodal Personalization Study

链接https://arxiv.org/abs/2602.19001

作者:Xia Hu,Honglei Zhuang,Brian Potetz,Alireza Fathi,Bo Hu,Babak Samari,Howard Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, modern Vision Language, Language Models open, Vision Language, Language Models

备注

点击查看摘要

Abstract:The powerful reasoning of modern Vision Language Models open a new frontier for advanced personalization study. However, progress in this area is critically hampered by the lack of suitable benchmarks. To address this gap, we introduce Life-Bench, a comprehensive, synthetically generated multimodal benchmark built on simulated user digital footprints. Life-Bench features over questions evaluating a wide spectrum of capabilities, from persona understanding to complex reasoning over historical data. These capabilities expand far beyond prior benchmarks, reflecting the critical demands essential for real-world applications. Furthermore, we propose LifeGraph, an end-to-end framework that organizes personal context into a knowledge graph to facilitate structured retrieval and reasoning. Our experiments on Life-Bench reveal that existing methods falter significantly on complex personalized tasks, exposing a large performance headroom, especially in relational, temporal and aggregative reasoning. While LifeGraph closes this gap by leveraging structured knowledge and demonstrates a promising direction, these advanced personalization tasks remain a critical open challenge, motivating new research in this area.

160. 【2602.18996】Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

链接https://arxiv.org/abs/2602.18996

作者:Shannan Yan,Leqi Zheng,Keyu Lv,Jingchen Ni,Hongyang Wei,Jiajun Zhang,Guangting Wang,Jing Lyu,Chun Yuan,Fengyun Rao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:establishing object-level visual, object-level visual correspondence, study the task, task of establishing, establishing object-level

备注: The paper has been accepted to CVPR 2026

点击查看摘要

Abstract:We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at this https URL.

161. 【2602.18993】SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

链接https://arxiv.org/abs/2602.18993

作者:Jiwoo Chung,Sangeek Hyun,MinKyu Lee,Byeongju Han,Geonho Cha,Dongyoon Wee,Youngjun Hong,Jae-Pil Heo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherently sequential denoising, sequential denoising process, denoising process leads, slow inference, strong backbone

备注: Accepted to CVPR 2026 Main. Project page: [this https URL](https://jiwoogit.github.io/SeaCache)

点击查看摘要

Abstract:Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.

162. 【2602.18990】IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

链接https://arxiv.org/abs/2602.18990

作者:Yuyang Ji,Yixuan Shen,Kien Nguyen,Lifeng Zhou,Feng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieves robust identification, integrating face, robust identification, identification by integrating, Video-based person recognition

备注

点击查看摘要

Abstract:Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect's superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.

163. 【2602.18977】Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

链接https://arxiv.org/abs/2602.18977

作者:Thinesh Thiyakesan Ponbagavathi,Constantin Seibold,Alina Roitberg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adapting image-pretrained backbones, video typically relies, Adapting image-pretrained, single temporal scale, time-domain adapters tuned

备注: Accepted to CVPR 2026 (Main Track)

点击查看摘要

Abstract:Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at this https URL.

Comments:
Accepted to CVPR 2026 (Main Track)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.18977 [cs.CV]

(or
arXiv:2602.18977v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.18977

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
164. 【2602.18965】Face Presentation Attack Detection via Content-Adaptive Spatial Operators

链接https://arxiv.org/abs/2602.18965

作者:Shujaat Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:securing facial authentication, Face presentation attack, Face presentation, presentation attack detection, authentication against print

备注: 14 Pages, 8 Figures

点击查看摘要

Abstract:Face presentation attack detection (FacePAD) is critical for securing facial authentication against print, replay, and mask-based spoofing. This paper proposes CASO-PAD, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues. Unlike spatially shared convolution kernels, the proposed operator generates location-specific, channel-shared kernels conditioned on the input, improving spatial selectivity with minimal overhead. CASO-PAD remains lightweight (3.6M parameters; 0.64 GFLOPs at $256\times256$) and is trained end-to-end using a standard binary cross-entropy objective. Extensive experiments on Replay-Attack, Replay-Mobile, ROSE-Youtu, and OULU-NPU demonstrate strong performance, achieving 100/100/98.9/99.7\% test accuracy, AUC of 1.00/1.00/0.9995/0.9999, and HTER of 0.00/0.00/0.82/0.44\%, respectively. On the large-scale SiW-Mv2 Protocol-1 benchmark, CASO-PAD further attains 95.45\% accuracy with 3.11\% HTER and 3.13\% EER, indicating improved robustness under diverse real-world attacks. Ablation studies show that placing the adaptive operator near the network head and using moderate group sharing yields the best accuracy--efficiency balance. Overall, CASO-PAD provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.

165. 【2602.18961】Depth-Enhanced YOLO-SAM2 Detection for Reliable Ballast Insufficiency Identification

链接https://arxiv.org/abs/2602.18961

作者:Shiyu Liu,Dylan Lester,Husnu Narman,Ammar Alzarrad,Pingping Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)

关键词:detecting ballast insufficiency, framework for detecting, paper presents, RGB-D data, RGB-D data show

备注: Submitted to the IEEE International Symposium on Robotic and Sensors Environments (ROSE) 2026

点击查看摘要

Abstract:This paper presents a depth-enhanced YOLO-SAM2 framework for detecting ballast insufficiency in railway tracks using RGB-D data. Although YOLOv8 provides reliable localization, the RGB-only model shows limited safety performance, achieving high precision (0.99) but low recall (0.49) due to insufficient ballast, as it tends to over-predict the sufficient class. To improve reliability, we incorporate depth-based geometric analysis enabled by a sleeper-aligned depth-correction pipeline that compensates for RealSense spatial distortion using polynomial modeling, RANSAC, and temporal smoothing. SAM2 segmentation further refines region-of-interest masks, enabling accurate extraction of sleeper and ballast profiles for geometric classification. Experiments on field-collected top-down RGB-D data show that depth-enhanced configurations substantially improve the detection of insufficient ballast. Depending on bounding-box sampling (AABB or RBB) and geometric criteria, recall increases from 0.49 to as high as 0.80, and F1-score improves from 0.66 to over 0.80. These results demonstrate that integrating depth correction with YOLO-SAM2 yields a more robust and reliable approach for automated railway ballast inspection, particularly in visually ambiguous or safety-critical scenarios.

Comments:
Submitted to the IEEE International Symposium on Robotic and Sensors Environments (ROSE) 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)

Cite as:
arXiv:2602.18961 [cs.CV]

(or
arXiv:2602.18961v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.18961

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
166. 【2602.18959】YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

链接https://arxiv.org/abs/2602.18959

作者:Kedi Sun,Le Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise intraoperative decisions, Trauma THOMPSON Challenge, surgery is essential, essential for supporting, supporting rapid

备注

点击查看摘要

Abstract:Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67\%) and right-hand (71\%) classification, while distinguishing hands from the background remains challenging. The model achieves an $mAP_{[0.5:0.95]}$ of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.

167. 【2602.18941】Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation

链接https://arxiv.org/abs/2602.18941

作者:Kaiming Jin,Yuefan Wu,Shengqiong Wu,Bobo Li,Shuicheng Yan,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:embodied human-AI collaboration, follow natural language, natural language instructions, execute coherent action, coherent action sequences

备注: 18 pages, 9 figures

点击查看摘要

Abstract:Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: this https URL

168. 【2602.18936】CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

链接https://arxiv.org/abs/2602.18936

作者:Yu Li,Yujun Cai,Chi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesizing images based, Personalized image generation, Personalized image, requires effectively balancing, effectively balancing content

备注

点击查看摘要

Abstract:Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.

169. 【2602.18907】DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

链接https://arxiv.org/abs/2602.18907

作者:Yangchen Zeng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:demonstrated remarkable scaling, remarkable scaling potential, reformulating item prediction, Recent generative recommendation, Recent generative

备注

点击查看摘要

Abstract:Recent generative recommendation frameworks have demonstrated remarkable scaling potential by reformulating item prediction as autoregressive Semantic ID (SID) generation. However, existing methods primarily rely on shallow behavioral signals, encoding items solely through surface-level textual features such as titles and descriptions. This reliance results in a critical Shallow Interest problem: the model fails to capture the latent, semantically rich interests underlying user interactions, limiting both personalization depth and recommendation interpretability. DeepInterestGR introduces three key innovations: (1) Multi-LLM Interest Mining (MLIM): We leverage multiple frontier LLMs along with their multi-modal variants to extract deep textual and visual interest representations through Chain-of-Thought prompting. (2) Reward-Labeled Deep Interest (RLDI): We employ a lightweight binary classifier to assign reward labels to mined interests, enabling effective supervision signals for reinforcement learning. (3) Interest-Enhanced Item Discretization (IEID): The curated deep interests are encoded into semantic embeddings and quantized into SID tokens via RQ-VAE. We adopt a two-stage training pipeline: supervised fine-tuning aligns the generative model with deep interest signals and collaborative filtering patterns, followed by reinforcement learning with GRPO optimized by our Interest-Aware Reward. Experiments on three Amazon Review benchmarks demonstrate that DeepInterestGR consistently outperforms state-of-the-art baselines across HR@K and NDCG@K metrics.

170. 【2602.18906】Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

链接https://arxiv.org/abs/2602.18906

作者:Shengjie Zhu,Ahmed Abdelkader,Mark J. Matthews,Xiaoming Liu,Wen-Sheng Chu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Monocular Depth Estimation, recovering camera parameters, parameters and scene, scene geometry, Marginalized Bundle Adjustment

备注

点击查看摘要

Abstract:Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.

171. 【2602.18904】PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

链接https://arxiv.org/abs/2602.18904

作者:Hao Lu,Onur C. Koyun,Yongxin Guo,Zhengjie Zhu,Abbas Alili,Metin Nafi Gurcan

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vector-quantized autoencoders deliver, requires straight-through hacks, suffer inherent flaws, autoencoders deliver high-fidelity, Vector-quantized autoencoders

备注

点击查看摘要

Abstract:Vector-quantized autoencoders deliver high-fidelity latents but suffer inherent flaws: the quantizer is non-differentiable, requires straight-through hacks, and is prone to collapse. We address these issues at the root by replacing VQ with a simple, principled, and fully differentiable alternative: an online PCA bottleneck trained via Oja's rule. The resulting model, PCA-VAE, learns an orthogonal, variance-ordered latent basis without codebooks, commitment losses, or lookup noise. Despite its simplicity, PCA-VAE exceeds VQ-GAN and SimVQ in reconstruction quality on CelebAHQ while using 10-100x fewer latent bits. It also produces naturally interpretable dimensions (e.g., pose, lighting, gender cues) without adversarial regularization or disentanglement objectives. These results suggest that PCA is a viable replacement for VQ: mathematically grounded, stable, bit-efficient, and semantically structured, offering a new direction for generative models beyond vector quantization.

172. 【2602.18903】SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model

链接https://arxiv.org/abs/2602.18903

作者:Luca Cazzaniga

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Harmonized Engineered Modular, Google Gemini, paper presents SCHEMA, Pro Image, developed for Google

备注: 24 pages, 8 tables. Based on SCHEMA Method v1.0 (deposited December 11, 2025). Previously published on Zenodo: doi: [https://doi.org/10.5281/zenodo.18721380](https://doi.org/10.5281/zenodo.18721380)

点击查看摘要

Abstract:This paper presents SCHEMA (Structured Components for Harmonized Engineered Modular Architecture), a structured prompt engineering methodology specifically developed for Google Gemini 3 Pro Image. Unlike generic prompt guidelines or model-agnostic tips, SCHEMA is an engineered framework built on systematic professional practice encompassing 850 verified API predictions within an estimated corpus of approximately 4,800 generated images, spanning six professional domains: real estate photography, commercial product photography, editorial content, storyboards, commercial campaigns, and information design. The methodology introduces a three-tier progressive system (BASE, MEDIO, AVANZATO) that scales practitioner control from exploratory (approximately 5%) to directive (approximately 95%), a modular label architecture with 7 core and 5 optional structured components, a decision tree with explicit routing rules to alternative tools, and systematically documented model limitations with corresponding workarounds. Key findings include an observed 91% Mandatory compliance rate and 94% Prohibitions compliance rate across 621 structured prompts, a comparative batch consistency test demonstrating substantially higher inter-generation coherence for structured prompts, independent practitioner validation (n=40), and a dedicated Information Design validation demonstrating 95% first-generation compliance for spatial and typographical control across approximately 300 publicly verifiable infographics. Previously published on Zenodo (doi:https://doi.org/10.5281/zenodo.18721380).

173. 【2602.18900】PrivacyBench: Privacy Isn't Free in Hybrid Privacy-Preserving Vision Systems

链接https://arxiv.org/abs/2602.18900

作者:Nnaemeka Obiefuna,Samuel Oyeneye,Similoluwa Odunaiya,Iremide Oyelaja,Steven Kolawole

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning applications, increasingly require combining, preserving machine learning, sensitive deep learning, require combining multiple

备注: 20 pages, 2 figures

点击查看摘要

Abstract:Privacy preserving machine learning deployments in sensitive deep learning applications; from medical imaging to autonomous systems; increasingly require combining multiple techniques. Yet, practitioners lack systematic guidance to assess the synergistic and non-additive interactions of these hybrid configurations, relying instead on isolated technique analysis that misses critical system level interactions. We introduce PrivacyBench, a benchmarking framework that reveals striking failures in privacy technique combinations with severe deployment implications. Through systematic evaluation across ResNet18 and ViT models on medical datasets, we uncover that FL + DP combinations exhibit severe convergence failure, with accuracy dropping from 98% to 13% while compute costs and energy consumption substantially increase. In contrast, FL + SMPC maintains near-baseline performance with modest overhead. Our framework provides the first systematic platform for evaluating privacy-utility-cost trade-offs through automated YAML configuration, resource monitoring, and reproducible experimental protocols. PrivacyBench enables practitioners to identify problematic technique interactions before deployment, moving privacy-preserving computer vision from ad-hoc evaluation toward principled systems design. These findings demonstrate that privacy techniques cannot be composed arbitrarily and provide critical guidance for robust deployment in resource-constrained environments.

174. 【2602.18896】Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization

链接https://arxiv.org/abs/2602.18896

作者:Hao Lu,Onur C. Koyun,Yongxin Guo,Zhengjie Zhu,Abbas Alili,Metin Nafi Gurcan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vector Quantization, Transformer-based Vector Quantization, latent diffusion models, modern generative frameworks, Non-Stationary Vector Quantization

备注

点击查看摘要

Abstract:Vector Quantization (VQ) underpins many modern generative frameworks such as VQ-VAE, VQ-GAN, and latent diffusion models. Yet, it suffers from the persistent problem of codebook collapse, where a large fraction of code vectors remains unused during training. This work provides a new theoretical explanation by identifying the nonstationary nature of encoder updates as the fundamental cause of this phenomenon. We show that as the encoder drifts, unselected code vectors fail to receive updates and gradually become inactive. To address this, we propose two new methods: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution. Experiments on the CelebA-HQ dataset demonstrate that both methods achieve near-complete codebook utilization and superior reconstruction quality compared to baseline VQ variants, providing a principled and scalable foundation for future VQ-based generative models. The code is available at: this https URL LAB- WFUSM/NSVQthis http URL

175. 【2602.18887】SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

链接https://arxiv.org/abs/2602.18887

作者:Jungho Kim,Jiyong Oh,Seunghoon Yu,Hongjae Shin,Donghyuk Kwak,Jun Won Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:maps sensor inputs, sensor inputs directly, recently attracted significant, attracted significant attention, significant attention due

备注: Accepted to CVPR 2026, 19 pages, 9 figures

点击查看摘要

Abstract:The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.

176. 【2602.18886】PhysConvex: Physics-Informed 3D Dynamic Convex Radiance Fields for Reconstruction and Simulation

链接https://arxiv.org/abs/2602.18886

作者:Dan Wang,Xinrui Cui,Serge Belongie,Ravi Ramamoorthi

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Reconstructing and simulating, physical consistency remains, fundamental challenge, consistency remains, remains a fundamental

备注

点击查看摘要

Abstract:Reconstructing and simulating dynamic 3D scenes with both visual realism and physical consistency remains a fundamental challenge. Existing neural representations, such as NeRFs and 3DGS, excel in appearance reconstruction but struggle to capture complex material deformation and dynamics. We propose PhysConvex, a Physics-informed 3D Dynamic Convex Radiance Field that unifies visual rendering and physical simulation. PhysConvex represents deformable radiance fields using physically grounded convex primitives governed by continuum mechanics. We introduce a boundary-driven dynamic convex representation that models deformation through vertex and surface dynamics, capturing spatially adaptive, non-uniform deformation, and evolving boundaries. To efficiently simulate complex geometries and heterogeneous materials, we further develop a reduced-order convex simulation that advects dynamic convex fields using neural skinning eigenmodes as shape- and material-aware deformation bases with time-varying reduced DOFs under Newtonian dynamics. Convex dynamics also offers compact, gap-free volumetric coverage, enhancing both geometric efficiency and simulation fidelity. Experiments demonstrate that PhysConvex achieves high-fidelity reconstruction of geometry, appearance, and physical properties from videos, outperforming existing methods.

177. 【2602.18882】SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

链接https://arxiv.org/abs/2602.18882

作者:Mohammad Asim,Christopher Wewer,Jan Eric Lenssen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:present SceneTok, compressed and diffusable, encoding view sets, diffusable set, scene

备注: Project website: [this https URL](https://geometric-rl.mpi-inf.mpg.de/scenetok/)

点击查看摘要

Abstract:We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

178. 【2602.18880】FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

链接https://arxiv.org/abs/2602.18880

作者:Zhou Liu,Tonghua Su,Hongshi Zhang,Fuxiang Yang,Donglin Di,Yang Song,Lei Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pose significant challenges, image tampering techniques, digital forensics, generative models, pose significant

备注

点击查看摘要

Abstract:Advances in image tampering techniques, particularly generative models, pose significant challenges to media verification, digital forensics, and public trust. Existing image forgery detection and localization (IFDL) methods suffer from two key limitations: over-reliance on semantic content while neglecting textural cues, and limited interpretability of subtle low-level tampering traces. To address these issues, we propose FOCA, a multimodal large language model-based framework that integrates discriminative features from both the RGB spatial and frequency domains via a cross-attention fusion module. This design enables accurate forgery detection and localization while providing explicit, human-interpretable cross-domain explanations. We further introduce FSE-Set, a large-scale dataset with diverse authentic and tampered images, pixel-level masks, and dual-domain annotations. Extensive experiments show that FOCA outperforms state-of-the-art methods in detection performance and interpretability across both spatial and frequency domains.

179. 【2602.18874】Structure-Level Disentangled Diffusion for Few-Shot Chinese Font Generation

链接https://arxiv.org/abs/2602.18874

作者:Jie Li,Suorong Yang,Jian Zhao,Furao Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Few-shot Chinese font, Chinese font generation, Few-shot Chinese, font generation aims, Chinese font

备注

点击查看摘要

Abstract:Few-shot Chinese font generation aims to synthesize new characters in a target style using only a handful of reference images. Achieving accurate content rendering and faithful style transfer requires effective disentanglement between content and style. However, existing approaches achieve only feature-level disentanglement, allowing the generator to re-entangle these features, leading to content distortion and degraded style fidelity. We propose the Structure-Level Disentangled Diffusion Model (SLD-Font), which receives content and style information from two separate channels. SimSun-style images are used as content templates and concatenated with noisy latent features as the input. Style features extracted by a CLIP model from target-style images are integrated via cross-attention. Additionally, we train a Background Noise Removal module in the pixel space to remove background noise in complex stroke regions. Based on theoretical validation of disentanglement effectiveness, we introduce a parameter-efficient fine-tuning strategy that updates only the style-related modules. This allows the model to better adapt to new styles while avoiding overfitting to the reference images' content. We further introduce the Grey and OCR metrics to evaluate the content quality of generated characters. Experimental results show that SLD-Font achieves significantly higher style fidelity while maintaining comparable content accuracy to existing state-of-the-art methods.

180. 【2602.18873】BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

链接https://arxiv.org/abs/2602.18873

作者:Miaowei Wang,Qingxuan Yan,Zhi Cao,Yayuan Li,Oisin Mac Aodha,Jason J. Corso,Amir Vaxman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:descriptions remains challenging, textual descriptions remains, Text-guided dynamic, faithfully reflects rich, reflects rich textual

备注: CVPR 2026 Accepted with Scores 5,5,5

点击查看摘要

Abstract:Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality. To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations. Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation. Our project page is at: this https URL.

181. 【2602.18869】Enhancing 3D LiDAR Segmentation by Shaping Dense and Accurate 2D Semantic Predictions

链接https://arxiv.org/abs/2602.18869

作者:Xiaoyu Dong,Tiankui Xian,Wanshui Gan,Naoto Yokoya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world street environments, urban remote sensing, understanding real-world street, LiDAR point clouds, point clouds

备注

点击查看摘要

Abstract:Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real-world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi-modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross-modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.

182. 【2602.18867】Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

链接https://arxiv.org/abs/2602.18867

作者:Zhuofan Xie,Zishan Lin,Jinliang Lin,Jie Qi,Shaohua Hong,Shuo Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Active Learning, reduces annotation costs, data are scarce, labeled data, Active

备注: 8 pages, 5 figures, Accepted to CVPR 2026 (to appear)

点击查看摘要

Abstract:Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text-image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.

183. 【2602.18861】Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

链接https://arxiv.org/abs/2602.18861

作者:Shile Li,Markus Karmann,Onay Urfalioglu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers trained, Vision Transformers, quantization of Vision, Transformers trained, joint quantization

备注

点击查看摘要

Abstract:We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a adjective photo of adjective cls".

184. 【2602.18858】Hyperbolic Busemann Neural Networks

链接https://arxiv.org/abs/2602.18858

作者:Ziheng Chen,Bernhard Schölkopf,Nicu Sebe

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:exponential volume growth, tree-structured data due, Multinomial Logistic Regression, volume growth, natural geometry

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth. To leverage these benefits, neural networks require intrinsic and efficient components that operate directly in hyperbolic space. In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. BMLR provides compact parameters, a point-to-horosphere distance interpretation, batch-efficient computation, and a Euclidean limit, while BFC generalizes FC and activation layers with comparable complexity. Experiments on image classification, genome sequence learning, node classification, and link prediction demonstrate improvements in effectiveness and efficiency over prior hyperbolic layers. The code is available at this https URL.

185. 【2602.18853】Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

链接https://arxiv.org/abs/2602.18853

作者:Dong Zhao,Qi Zang,Nan Pu,Wenjing Li,Nicu Sebe,Zhun Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic Segmentation, enable segmentation models, Open-Vocabulary Semantic Segmentation, aims to enable, perform robustly

备注

点击查看摘要

Abstract:Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text-image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S2-Corr, a state-space-driven text-image correlation refinement mechanism that mitigates domain-induced distortions and produces more consistent text-image correlations under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

186. 【2602.18846】DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

链接https://arxiv.org/abs/2602.18846

作者:Aditya Kumar Singh,Hitesh Kandala,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieved remarkable multimodal, remarkable multimodal understanding, remain computationally expensive, computationally expensive due, Vision-language models

备注: 15 Pages, 8 figures, 15 tables, CVPR 2026; Code: [this https URL](https://github.com/AMD-AGI/DUET-VLM)

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains 97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving 100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at this https URL.

187. 【2602.18845】Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

链接https://arxiv.org/abs/2602.18845

作者:Chengwei Xia,Fan Ma,Ruijie Quan,Yunqiu Xu,Kun Zhan,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intellectual property protection, raising significant concerns, multimodal large language, large language models, model version attribution

备注: Accepted to CVPR 2026!

点击查看摘要

Abstract:With the rapid deployment and widespread adoption of multimodal large language models (MLLMs), disputes regarding model version attribution and ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownership-related textual responses exclusively in fine-tuned derivatives of the original model, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model derived from the original model itself. This auxiliary model is specifically trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.

188. 【2602.18842】Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification

链接https://arxiv.org/abs/2602.18842

作者:Jiangling Zhang,Shuxuan Gao,Bofan Liu,Siqiang Feng,Jirui Huang,Yaxiong Chen,Ziyu Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly realistic AI-generated, poses critical challenges, demanding precise pixel-level, realistic AI-generated images, AI-generated images poses

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:The proliferation of highly realistic AI-generated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries and often struggle with novel manipulations as editing techniques continue to evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning "what is fake" to modeling "what is real". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, followed by a Task-Adaptive Prior Injection (TAPI) module that converts this coarse prediction into guiding prompts to steer the MAE decoder and amplify reconstruction failures in suspicious regions for precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.

189. 【2602.18833】CLAP Convolutional Lightweight Autoencoder for Plant Disease Classification

链接https://arxiv.org/abs/2602.18833

作者:Asish Bera,Subhajit Roy,Sudiptendu Banerjee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:nutrition deficiency prediction, distinguishing plant diseases, severity grading, Convolutional neural networks, remarkably progressed

备注

点击查看摘要

Abstract:Convolutional neural networks have remarkably progressed the performance of distinguishing plant diseases, severity grading, and nutrition deficiency prediction using leaf images. However, these tasks become more challenging in a realistic in-situ field condition. Often, a traditional machine learning model may fail to capture and interpret discriminative characteristics of plant health, growth and diseases due to subtle variations within leaf subcategories. A few deep learning methods have used additional preprocessing stages or network modules to address the problem, whereas several other methods have utilized pre-trained backbone CNNs, most of which are computationally intensive. Therefore, to address the challenge, we propose a lightweight autoencoder using separable convolutional layers in its encoder decoder blocks. A sigmoid gating is applied for refining the prowess of the encoders feature discriminability, which is improved further by the decoder. Finally, the feature maps of the encoder decoder are combined for rich feature representation before classification. The proposed Convolutional Lightweight Autoencoder for Plant disease classification, called CLAP, has been experimented on three public plant datasets consisting of cassava, tomato, maize, groundnut, grapes, etc. for determining plant health conditions. The CLAP has attained improved or competitive accuracies on the Integrated Plant Disease, Groundnut, and CCMT datasets balancing a tradeoff between the performance, and little computational cost requiring 5 million parameters. The training time is 20 milliseconds and inference time is 1 ms per image.

190. 【2602.18831】IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbation

链接https://arxiv.org/abs/2602.18831

作者:Fadi Boutros,Eduarda Caldeira,Tahar Chettaoui,Naser Damer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:legal concerns increasingly, concerns increasingly restrict, real biometric data, practical alternative, alternative to authentic

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDPERTURB, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDPERTURB perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results demonstrate that training FR on datasets generated using IDPERTURB yields improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches.

191. 【2602.18830】Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

链接https://arxiv.org/abs/2602.18830

作者:Liying Yang,Jialun Liu,Jiakui Hu,Chenhao Guan,Haibin Huang,Fangqiu Yi,Chi Zhang,Yanyan Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating high-quality, Spatial-Temporal State Propagation, State Propagation AutoRegressive, spatial-temporal, Propagation AutoRegressive Model

备注

点击查看摘要

Abstract:Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.

192. 【2602.18825】Bayesian Lottery Ticket Hypothesis

链接https://arxiv.org/abs/2602.18825

作者:Nicholas Kuhn,Arvid Weyrauch,Lars Heyen,Achim Streit,Markus Götz,Charlotte Debus

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:conventional neural networks, Bayesian neural networks, neural networks, Lottery Ticket Hypothesis, conventional neural

备注

点击查看摘要

Abstract:Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.

193. 【2602.18822】Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

链接https://arxiv.org/abs/2602.18822

作者:Xiaoyu Dong,Jiahuan Li,Ziteng Cui,Naoto Yokoya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex spatial misalignments, study cross-modal super-resolution, guide image pairs, real-world misaligned data, number of low-resolution

备注

点击查看摘要

Abstract:We study cross-modal super-resolution (SR) on real-world misaligned data, where only a limited number of low-resolution (LR) source and high-resolution (HR) guide image pairs with complex spatial misalignments are available. To address this challenge, we propose RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment. RobSelf features two key techniques: a misalignment-aware feature translator and a content-aware reference filter. The translator reformulates unsupervised cross-modal and cross-resolution alignment as a weakly-supervised, misalignment-aware translation subtask, producing an aligned guide feature with inherent redundancy. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source, enabling SR predictions with high resolution and high fidelity. Across a variety of tasks, we demonstrate that RobSelf achieves state-of-the-art performance and superior efficiency. Additionally, we introduce a real-world dataset, RealMisSR, to advance research on this topic. Dataset and code: this https URL.

194. 【2602.18817】HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

链接https://arxiv.org/abs/2602.18817

作者:Chongyang Xu,Shen Cheng,Haipeng Li,Haoqiang Fan,Ziliang Feng,Shuaicheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Imitation learning, explicitly encode geometry, representations that explicitly, learning for robotic, explicitly encode

备注: Accepted by ICRA 2026

点击查看摘要

Abstract:Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at this https URL.

195. 【2602.18811】Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

链接https://arxiv.org/abs/2602.18811

作者:Wanqi Wang,Jingcai Guo,Yuxiang Cai,Zhi Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-Shot Object Detection, Cross-Domain Few-Shot Object, Few-Shot Object, Object Detection, unseen target domains

备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.

196. 【2602.18799】Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

链接https://arxiv.org/abs/2602.18799

作者:Zhou Jiang,Yandong Wen,Zhen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preferences remains challenging, nuanced human preferences, human preferences remains, Aligning large-scale, remains challenging

备注

点击查看摘要

Abstract:Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and controllable alignment signal. We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.

197. 【2602.18792】MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

链接https://arxiv.org/abs/2602.18792

作者:Changlu Guo,Anders Nymark Christensen,Anders Bjorholm Dahl,Morten Rieger Hannemose

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, minimal semantic modifications, model prediction, providing causal, neural networks

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

198. 【2602.18766】Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

链接https://arxiv.org/abs/2602.18766

作者:Pablo Meseguer,Rocío del Amor,Valery Naranjo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision language models, histopathological image-caption pairs, image-caption pairs enabled, Vision language, pairs enabled zero-shot

备注: Accepted as oral presentation at CASEIB 2024 held in Sevilla, Spain

点击查看摘要

Abstract:Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

199. 【2602.18765】A high-resolution nationwide urban village mapping product for 342 Chinese cities based on foundation models

链接https://arxiv.org/abs/2602.18765

作者:Lubin Bai,Sheng Xiao,Ziyu Yin,Haoyu Wang,Siyang Wu,Xiuyuan Zhang,Shihong Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:China rapidly urbanizing, rapidly urbanizing cities, Urban Villages, represent a distinctive, distinctive form

备注: Submitted to Earth System Science Data

点击查看摘要

Abstract:Urban Villages (UVs) represent a distinctive form of high-density informal settlement embedded within China's rapidly urbanizing cities. Accurate identification of UVs is critical for urban governance, renewal, and sustainable development. But due to the pronounced heterogeneity and diversity of UVs across China's vast territory, a consistent and reliable nationwide dataset has been lacking. In this work, we present GeoLink-UV, a high-resolution nationwide UV mapping product that clearly delineates the locations and boundaries of UVs in 342 Chinese cities. The dataset is derived from multisource geospatial data, including optical remote sensing images and geo-vector data, and is generated through a foundation model-driven mapping framework designed to address the generalization issues and improve the product quality. A geographically stratified accuracy assessment based on independent samples from 28 cities confirms the reliability and scientific credibility of the nationwide dataset across heterogeneous urban contexts. Based on this nationwide product, we reveal substantial interregional disparities in UV prevalence and spatial configuration. On average, UV areas account for 8 % of built-up land, with marked clustering in central and south China. Building-level analysis further confirms a consistent low-rise, high-density development pattern of UVs nationwide, while highlighting regionally differentiated morphological characteristics. The GeoLink-UV dataset provides an open and systematically validated geospatial foundation for urban studies, informal settlement monitoring, and evidence-based urban renewal planning, and contributes directly to large-scale assessments aligned with Sustainable Development Goal 11. The GeoLink-UV dataset introduced in this article is freely available at this https URL.

200. 【2602.18763】AG: Thinking with Action Unit Grounding for Facial Expression Recognition

链接https://arxiv.org/abs/2602.18763

作者:Haobo Lin,Tianyi Bai,Jiajun Zhang,Xuanhao Chang,Sheng Lu,Fangming Gu,Zengjie Hu,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Facial Expression Recognition, Expression Recognition, meaningful facial cues, Action Unit Grounding, fine-grained visual understanding

备注: 33 pages, 8 figures

点击查看摘要

Abstract:Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at this https URL .

201. 【2602.18757】Driving with A Thousand Faces: A Benchmark for Closed-Loop Personalized End-to-End Autonomous Driving

链接https://arxiv.org/abs/2602.18757

作者:Xiaoru Dong,Ruiqin Li,Xiao Han,Zhenxuan Wu,Jiamin Wang,Jian Chen,Qi Jiang,SM Yiu,Xinge Zhu,Yuexin Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human driving behavior, neglecting individual differences, single average driving, Human driving, average driving style

备注

点击查看摘要

Abstract:Human driving behavior is inherently diverse, yet most end-to-end autonomous driving (E2E-AD) systems learn a single average driving style, neglecting individual differences. Achieving personalized E2E-AD faces challenges across three levels: limited real-world datasets with individual-level annotations, a lack of quantitative metrics for evaluating personal driving styles, and the absence of algorithms that can learn stylized representations from users' trajectories. To address these gaps, we propose Person2Drive, a comprehensive personalized E2E-AD platform and benchmark. It includes an open-source, flexible data collection system that simulates realistic scenarios to generate scalable and diverse personalized driving datasets; style vector-based evaluation metrics with Maximum Mean Discrepancy and KL divergence to comprehensively quantify individual driving behaviors; and a personalized E2E-AD framework with a style reward model that efficiently adapts E2E models for safe and individualized driving. Extensive experiments demonstrate that Person2Drive enables fine-grained analysis, reproducible evaluation, and effective personalization in end-to-end autonomous driving. Our dataset and code will be released after acceptance.

202. 【2602.18752】Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement

链接https://arxiv.org/abs/2602.18752

作者:Yuran Dong,Hang Dai,Mang Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:demonstrated powerful editing, powerful editing capabilities, diverse tasks, demonstrated powerful, capabilities across diverse

备注: ICLR 26

点击查看摘要

Abstract:Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks. However, a persistent and long-standing limitation is the decline in facial identity (ID) consistency during realistic portrait editing. Due to the human eye's high sensitivity to facial features, such inconsistency significantly hinders the practical deployment of these models. Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP due to Cross-source Distribution Bias and Cross-source Feature Contamination. To address these issues, we propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration. By systematically analyzing diffusion trajectories, sampler behaviors, and attention properties, we introduce three key components: 1) Adaptive mixing strategy that aligns cross-source latent representations throughout the diffusion process. 2) Hybrid solver that disentangles source-specific identity attributes and details. 3) Attentional gating mechanism that selectively entangles visual elements. Extensive experiments show that EditedID achieves state-of-the-art performance in preserving original facial ID and edited element IP consistency. As a training-free and plug-and-play solution, it establishes a new benchmark for practical and reliable single/multi-person facial identity restoration in open-world settings, paving the way for the deployment of multimodal editing large models in real-person editing scenarios. The code is available at this https URL.

203. 【2602.18747】Benchmarking Computational Pathology Foundation Models For Semantic Segmentation

链接https://arxiv.org/abs/2602.18747

作者:Lavish Ramchandani,Aashay Tinaikar,Dev Kumar Das,Rohit Garg,Tijo Thomas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable domain, remarkable domain generalization, unsupervised feature extraction, feature extraction capabilities, recent years

备注: 5 pages, submitted to IEEE ISBI 2026

点击查看摘要

Abstract:In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel-level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue-region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel-wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model-agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision-only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.

204. 【2602.18746】MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

链接https://arxiv.org/abs/2602.18746

作者:Haoyu Zhang,Yuwei Wu,Pengxiang Li,Xintong Zhang,Zhi Gao,Rui Gao,Mingyang Gao,Che Sun,Yunde Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhancing multimodal reasoning, complex visual inputs, multimodal reasoning capabilities, reasoning capabilities remains, critical challenge

备注

点击查看摘要

Abstract:In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

205. 【2602.18745】Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

链接https://arxiv.org/abs/2602.18745

作者:Haobo Lin,Tianyi Bai,Chen Chen,Jiajun Zhang,Bohan Zeng,Wentao Zhang,Binhang Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:limited training data, geometric constructions due, jointly understand visual, language models struggle, complex geometric constructions

备注: 58 pages, 10 figures

点击查看摘要

Abstract:Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at this https URL.

206. 【2602.18742】RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

链接https://arxiv.org/abs/2602.18742

作者:Seungku Kim,Suhyeok Jang,Byungjun Yoon,Dongyoung Kim,John Won,Jinwoo Shin

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:video generative models, imperfectly generated videos, action quality due, inconsistent action quality, scalable pipeline

备注: 20 pages; 6 figures; Project page is available at [this https URL](https://seungkukim.github.io/robocurate/)

点击查看摘要

Abstract:Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.

207. 【2602.18741】Compact Hadamard Latent Codes for Efficient Spectral Rendering

链接https://arxiv.org/abs/2602.18741

作者:Jiaqi Yu,Dar'ya Guarnera,Giuseppe Claudio Guarnera

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately reproduces wavelength-dependent, reproduces wavelength-dependent appearance, scales roughly linearly, rendering accurately reproduces, RGB

备注

点击查看摘要

Abstract:Spectral rendering accurately reproduces wavelength-dependent appearance but is computationally expensive, as shading must be evaluated at many wavelength samples and scales roughly linearly with the number of samples. It also requires spectral textures and lights throughout the rendering pipeline. We propose Hadamard spectral codes, a compact latent representation that enables spectral rendering using standard RGB rendering operations. Spectral images are approximated with a small number of RGB rendering passes, followed by a decoding step. Our key requirement is latent linearity: scaling and addition in spectral space correspond to scaling and addition of codes, and the element-wise product of spectra (for example reflectance times illumination) is approximated by the element-wise product of their latent codes. We show that an exact low-dimensional algebra-preserving representation cannot exist for arbitrary spectra when the latent dimension k is smaller than the number of spectral samples n. We therefore introduce a learned non-negative linear encoder and decoder architecture that preserves scaling and addition exactly while encouraging approximate multiplicativity under the Hadamard product. With k = 6, we render k/3 = 2 RGB images per frame using an unmodified RGB renderer, reconstruct the latent image, and decode to high-resolution spectra or XYZ or RGB. Experiments on 3D scenes demonstrate that k = 6 significantly reduces color error compared to RGB baselines while being substantially faster than naive n-sample spectral rendering. Using k = 9 provides higher-quality reference results. We further introduce a lightweight neural upsampling network that maps RGB assets directly to latent codes, enabling integration of legacy RGB content into the spectral pipeline while maintaining perceptually accurate colors in rendered images.

208. 【2602.18735】LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency

链接https://arxiv.org/abs/2602.18735

作者:Weilong Yan,Haipeng Li,Hao Xu,Nianjin Ye,Yihao Ai,Shuaicheng Liu,Jingyu Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:rich geometric priors, paper introduces LaS-Comp, zero-shot and category-agnostic, leverages the rich, rich geometric

备注: Accepted to CVPR2026

点击查看摘要

Abstract:This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{this https URL}{LaS-Comp}.

209. 【2602.18729】MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

链接https://arxiv.org/abs/2602.18729

作者:Sagarika Banerjee,Tangatar Madi,Advait Swaminathan,Nguyen Dao Minh Anh,Shivank Garg,Kevin Zhu,Vasu Sharma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significant real-world consequences, identifying real-world risk, real-world risk scenarios, socially critical contexts, Fine-grained image-caption alignment

备注: EACL 2026, Main, Short Paper

点击查看摘要

Abstract:Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.

210. 【2602.18728】Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering

链接https://arxiv.org/abs/2602.18728

作者:Mingdong Lu,Zhikui Chen,Meng Liu,Shubin Ma,Liang Zhao

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:leveraging complementary information, reliable shared structural, Unsupervised multi-view clustering, guide representation learning, aims to partition

备注: Preprint. Under review

点击查看摘要

Abstract:Unsupervised multi-view clustering (MVC) aims to partition data into meaningful groups by leveraging complementary information from multiple views without labels, yet a central challenge is to obtain a reliable shared structural signal to guide representation learning and cross-view alignment under view discrepancy and noise. Existing approaches often rely on magnitude-only affinities or early pseudo targets, which can be unstable when different views induce relations with comparable strengths but contradictory directional tendencies, thereby distorting the global spectral geometry and degrading clustering. In this paper, we propose \emph{Phase-Consistent Magnetic Spectral Learning} for MVC: we explicitly model cross-view directional agreement as a phase term and combine it with a nonnegative magnitude backbone to form a complex-valued magnetic affinity, extract a stable shared spectral signal via a Hermitian magnetic Laplacian, and use it as structured self-supervision to guide unsupervised multi-view representation learning and clustering. To obtain robust inputs for spectral extraction at scale, we construct a compact shared structure with anchor-based high-order consensus modeling and apply a lightweight refinement to suppress noisy or inconsistent relations. Extensive experiments on multiple public multi-view benchmarks demonstrate that our method consistently outperforms strong baselines.

211. 【2602.18726】WiCompass: Oracle-driven Data Scaling for mmWave Human Pose Estimation

链接https://arxiv.org/abs/2602.18726

作者:Bo Liang,Chen Gong,Haobo Wang,Qirui Liu,Rungui Zhou,Fengzhi Shao,Yubo Wang,Wei Gao,Kaichen Zhou,Guolong Cui,Chenren Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Millimeter-wave Human Pose, Human Pose Estimation, Millimeter-wave Human, Pose Estimation, Human Pose

备注: This paper has been accepted by The 32nd Annual International Conference on Mobile Computing and Networking (MobiCom'26)

点击查看摘要

Abstract:Millimeter-wave Human Pose Estimation (mmWave HPE) promises privacy but suffers from poor generalization under distribution shifts. We demonstrate that brute-force data scaling is ineffective for out-of-distribution (OOD) robustness; efficiency and coverage are the true bottlenecks. To address this, we introduce WiCompass, a coverage-aware data-collection framework. WiCompass leverages large-scale motion-capture corpora to build a universal pose space ``oracle'' that quantifies dataset redundancy and identifies underrepresented motions. Guided by this oracle, WiCompass employs a closed-loop policy to prioritize collecting informative missing samples. Experiments show that WiCompass consistently improves OOD accuracy at matched budgets and exhibits superior scaling behavior compared to conventional collection strategies. By shifting focus from brute-force scaling to coverage-aware data acquisition, this work offers a practical path toward robust mmWave sensing.

212. 【2602.18720】Subtle Motion Blur Detection and Segmentation from Static Image Artworks

链接https://arxiv.org/abs/2602.18720

作者:Ganesh Samarth,Sibendu Paul,Solale Tabarestani,Caren Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Streaming services serve, services serve hundreds, Streaming services, box art, viewers worldwide

备注: InProceedings of the Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Streaming services serve hundreds of millions of viewers worldwide, where visual assets such as thumbnails, box art, and cover images are critical for engagement. Subtle motion blur remains a pervasive quality issue, reducing visual clarity and negatively affecting user trust and click-through rates. However, motion blur detection from static images is underexplored, as existing methods and datasets focus on severe blur and lack fine-grained pixel-level annotations needed for quality-critical applications. Benchmarks such as GOPRO and NFS are dominated by strong synthetic blur and often contain residual blur in their sharp references, leading to ambiguous supervision. We propose SMBlurDetect, a unified framework combining high-quality motion blur specific dataset generation with an end-to-end detector capable of zero-shot detection at multiple granularities. Our pipeline synthesizes realistic motion blur from super high resolution aesthetic images using controllable camera and object motion simulations over SAM segmented regions, enhanced with alpha-aware compositing and balanced sampling to generate subtle, spatially localized blur with precise ground truth masks. We train a U-Net based detector with ImageNet pretrained encoders using a hybrid mask and image centric strategy incorporating curriculum learning, hard negatives, focal loss, blur frequency channels, and resolution aware this http URL method achieves strong zero-shot generalization, reaching 89.68% accuracy on GoPro (vs 66.50% baseline) and 59.77% Mean IoU on CUHK (vs 9.00% baseline), demonstrating 6.6x improvement in segmentation. Qualitative results show accurate localization of subtle blur artifacts, enabling automated filtering of low quality frames and precise region of interest extraction for intelligent cropping.

213. 【2602.18717】NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

链接https://arxiv.org/abs/2602.18717

作者:Yufan Wang,Sokratis Makrogiannis,Chandra Kambhamettu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:State Space Models, favorable scaling properties, State Space, recently gained traction, remote sensing change

备注: Code will be released at [this https URL](https://github.com/VimsLab/NeXt2Former-CD)

点击查看摘要

Abstract:State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.

214. 【2602.18711】HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing

链接https://arxiv.org/abs/2602.18711

作者:Ahmed Akl,Abdelwahed Khamis,Ali Cheraghian,Zhe Wang,Sara Khalifa,Kewen Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal understanding capabilities, incorrect factual information, reliable real-world deployment, demonstrated impressive multimodal, impressive multimodal understanding

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while preserving pre-trained knowledge? To address this question, we present a systematic analysis of LVLM decoders built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer's sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while preserving pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.

215. 【2602.18709】IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

链接https://arxiv.org/abs/2602.18709

作者:Tingyang Xiao,Liu Liu,Wei Feng,Zhengyu Zou,Xiaolin Zhou,Wei Sui,Hao Li,Dingwen Zhang,Zhizhong Su

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:lack deep semantic, deep semantic understanding, RGB semantic SLAM, lack deep, understanding and robust

备注: 15 pages

点击查看摘要

Abstract:Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.

216. 【2602.18702】hink with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

链接https://arxiv.org/abs/2602.18702

作者:Houlun Chen,Xin Wang,Guangyao Li,Yuwei Zhou,Yihan Chen,Jia Jia,Wenwu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Long video understanding, small short-video GQA, Two-stage Reinforced Curriculum, Reinforced Curriculum Strategy, fixed video context

备注

点击查看摘要

Abstract:Long video understanding is challenging due to rich and complicated multimodal clues in long temporal this http URL methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form this http URL,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long this http URL address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when this http URL-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated this http URL,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU this http URL ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.

217. 【2602.18697】Deep LoRA-Unfolding Networks for Image Restoration

链接https://arxiv.org/abs/2602.18697

作者:Xiangming Wang,Haijin Zeng,Benteng Sun,Jiezhang Cao,Kai Zhang,Qiangqiang Shen,Yongyong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gradient Descent Module, Proximal Mapping Module, http URL unfolds, http URL address, http URL design

备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and this http URL unfolds the iterative optimization steps into a stack of sequentially linked this http URL block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known this http URL, existing DUNs suffer from two critical limitations: (i) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and (ii) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained this http URL address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient this http URL introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding this http URL design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to $N$ times parameter reduction for an $N$-stage DUN with on-par or better this http URL experiments conducted on three IR tasks validate the efficiency of our method.

218. 【2602.18684】Systematic Analysis of Coupling Effects on Closed-Loop and Open-Loop Performance in Aerial Continuum Manipulators

链接https://arxiv.org/abs/2602.18684

作者:Niloufar Amiri,Shayan Sepahvand,Iraj Mantegh,Farrokh Janabi-Sharifi

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:aerial continuum manipulators, paper investigates, investigates two distinct, decoupled model, coupled formulations

备注: Submitted to the 2026 International Conference on Unmanned Aircraft Systems (ICUAS 2026)

点击查看摘要

Abstract:This paper investigates two distinct approaches to the dynamic modeling of aerial continuum manipulators (ACMs): the decoupled and the coupled formulations. Both open-loop and closed-loop behaviors of a representative ACM are analyzed. The primary objective is to determine the conditions under which the decoupled model attains accuracy comparable to the coupled model while offering reduced computational cost under identical numerical conditions. The system dynamics are first derived using the Euler--Lagrange method under the piecewise constant curvature (PCC) assumption, with explicit treatment of the near-zero curvature singularity. A decoupled model is then obtained by neglecting the coupling terms in the ACM dynamics, enabling systematic evaluation of open-loop responses under diverse actuation profiles and external wrenches. To extend the analysis to closed-loop performance, a novel dynamics-based proportional-derivative sliding mode image-based visual servoing (DPD-SM-IBVS) controller is developed for regulating image feature errors in the presence of a moving target. The controller is implemented with both coupled and decoupled models, allowing a direct comparison of their effectiveness. The open-loop simulations reveal pronounced discrepancies between the two modeling approaches, particularly under varying torque inputs and continuum arm parameters. Conversely, the closed-loop experiments demonstrate that the decoupled model achieves tracking accuracy on par with the coupled model (within subpixel error) while incurring lower computational cost.

219. 【2602.18647】Information-Guided Noise Allocation for Efficient Diffusion Training

链接https://arxiv.org/abs/2602.18647

作者:Gabriel Raya,Bac Nguyen,Georgios Batzolis,Yuhta Takida,Dejan Stancevic,Naoki Murata,Chieh-Hsin Lai,Yuki Mitsufuji,Luca Ambrogioni

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)

关键词:weakly informative noise, informative noise regions, models typically relies, typically relies, relies on manually

备注

点击查看摘要

Abstract:Training diffusion models typically relies on manually tuned noise schedules, which can waste computation on weakly informative noise regions and limit transfer across datasets, resolutions, and representations. We revisit noise schedule allocation through an information-theoretic lens and propose the conditional entropy rate of the forward process as a theoretically grounded, data-dependent diagnostic for identifying suboptimal noise-level allocation in existing schedules. Based on these insight, we introduce InfoNoise, a principled data-adaptive training noise schedule that replaces heuristic schedule design with an information-guided noise sampling distribution derived from entropy-reduction rates estimated from denoising losses already computed during training. Across natural-image benchmarks, InfoNoise matches or surpasses tuned EDM-style schedules, in some cases with a substantial training speedup (about $1.4\times$ on CIFAR-10). On discrete datasets, where standard image-tuned schedules exhibit significant mismatch, it reaches superior quality in up to $3\times$ fewer training steps. Overall, InfoNoise makes noise scheduling data-adaptive, reducing the need for per-dataset schedule design as diffusion models expand across domains.

220. 【2602.18618】Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

链接https://arxiv.org/abs/2602.18618

作者:Aashish Chandra,Aashutosh A V,Abhijit Das

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating realistic speaking, multi-entangled latent space, voice profile, approach for generating, generating realistic

备注: To appear in the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. Presented at Poster Session 1

点击查看摘要

Abstract:We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

221. 【2602.18614】Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

链接https://arxiv.org/abs/2602.18614

作者:Massoud Dehghan,Ramona Woitek,Amirreza Mahbod

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, vision-language foundation models, computer vision tasks, patch sizes, patch size

备注: 29 pages

点击查看摘要

Abstract:Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: this https URL

Comments:
29 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.18614 [cs.CV]

(or
arXiv:2602.18614v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.18614

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
222. 【2602.18606】OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

链接https://arxiv.org/abs/2602.18606

作者:Rwik Rana,Jesse Quattrociocchi,Dongmyeong Lee,Christian Ellis,Amanda Adkins,Adam Uccello,Garrett Warnell,Joydeep Biswas

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:essential global context, Aerial imagery, autonomous navigation, onboard sensing, enabling route planning

备注: Website : [this https URL](https://amrl.cs.utexas.edu/overseec/)

点击查看摘要

Abstract:Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into Interpret-Locate-Synthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses the user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.

223. 【2602.18585】BloomNet: Exploring Single vs. Multiple Object Annotation for Flower Recognition Using YOLO Variants

链接https://arxiv.org/abs/2602.18585

作者:Safwat Nusrat,Prithwiraj Bhattacharjee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:advancing automated agriculture, Precise localization, automated agriculture, plant phenotyping, yield monitoring

备注: Accepted for publication in 7th International Conference on Trends in Computational and Cognitive Engineering (TCCE-2025)

点击查看摘要

Abstract:Precise localization and recognition of flowers are crucial for advancing automated agriculture, particularly in plant phenotyping, crop estimation, and yield monitoring. This paper benchmarks several YOLO architectures such as YOLOv5s, YOLOv8n/s/m, and YOLOv12n for flower object detection under two annotation regimes: single-image single-bounding box (SISBB) and single-image multiple-bounding box (SIMBB). The FloralSix dataset, comprising 2,816 high-resolution photos of six different flower species, is also introduced. It is annotated for both dense (clustered) and sparse (isolated) scenarios. The models were evaluated using Precision, Recall, and Mean Average Precision (mAP) at IoU thresholds of 0.5 (mAP@0.5) and 0.5-0.95 (mAP@0.5:0.95). In SISBB, YOLOv8m (SGD) achieved the best results with Precision 0.956, Recall 0.951, mAP@0.5 0.978, and mAP@0.5:0.95 0.865, illustrating strong accuracy in detecting isolated flowers. With mAP@0.5 0.934 and mAP@0.5:0.95 0.752, YOLOv12n (SGD) outperformed the more complicated SIMBB scenario, proving robustness in dense, multi-object detection. Results show how annotation density, IoU thresholds, and model size interact: recall-optimized models perform better in crowded environments, whereas precision-oriented models perform best in sparse scenarios. In both cases, the Stochastic Gradient Descent (SGD) optimizer consistently performed better than alternatives. These density-sensitive sensors are helpful for non-destructive crop analysis, growth tracking, robotic pollination, and stress evaluation.

224. 【2602.18584】GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

链接https://arxiv.org/abs/2602.18584

作者:Guanghui Min,Tianhao Huang,Ke Wan,Chen Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:efficient instruction tuning, Targeted data selection, Targeted data, specific target task, instruction tuning

备注: 27 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via spectral filtering (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target this http URL experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

225. 【2602.18540】Rodent-Bench

链接https://arxiv.org/abs/2602.18540

作者:Thomas Heap,Laurence Aitchison,Emma Cahill,Adriana Casado Rodriguez

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, rodent behaviour footage

备注

点击查看摘要

Abstract:We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.

226. 【2602.18533】Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

链接https://arxiv.org/abs/2602.18533

作者:Andrew Fraser

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:emph, pressure creates navigable, generative pipeline, multiple levels, morphological pressure creates

备注

点击查看摘要

Abstract:We demonstrate that morphological pressure creates navigable gradients at multiple levels of the text-to-image generative pipeline. In Study~1, identity basins in Stable Diffusion 1.5 can be navigated using morphological descriptors -- constituent features like platinum blonde,'' beauty mark,'' and 1950s glamour'' -- without the target's name or photographs. A self-distillation loop (generating synthetic images from descriptor prompts, then training a LoRA on those outputs) achieves consistent convergence toward a specific identity as measured by ArcFace similarity. The trained LoRA creates a local coordinate system shaping not only the target identity but also its inverse: maximal away-conditioning produces eldritch'' structural breakdown in base SD1.5, while the LoRA-equipped model produces ``uncanny valley'' outputs -- coherent but precisely wrong. In Study~2, we extend this to prompt-level morphology. Drawing on phonestheme theory, we generate 200 novel nonsense words from English sound-symbolic clusters (e.g., \emph{cr-}, \emph{sn-}, \emph{-oid}, \emph{-ax}) and find that phonestheme-bearing candidates produce significantly more visually coherent outputs than random controls (mean Purity@1 = 0.371 vs.\ 0.209, p0.00001p 0.00001 p0.00001, Cohen's d=0.55d = 0.55 d=0.55). Three candidates -- \emph{snudgeoid}, \emph{crashax}, and \emph{broomix} -- achieve perfect visual consistency (Purity@1 = 1.0) with zero training data contamination, each generating a distinct, coherent visual identity from phonesthetic structure alone. Together, these studies establish that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces. We document phase transitions in identity basins, CFG-invariant identity stability, and novel visual concepts emerging from sub-lexical sound patterns.

227. 【2602.18532】VLANeXt: Recipes for Building Strong VLA Models

链接https://arxiv.org/abs/2602.18532

作者:Xiao-Ming Wu,Bin Fan,Kang Liao,Jian-Jian Jiang,Runze Yang,Yihang Luo,Zhonghua Wu,Wei-Shi Zheng,Chen Change Loy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:general-purpose policy learning, leveraging strong visual, policy learning, rise of large, visual and language

备注: 17 pages, 11 figures, Project Page: [this https URL](https://dravenalg.github.io/VLANeXt/)

点击查看摘要

Abstract:Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

228. 【2602.18530】Image-Based Classification of Olive Varieties Native to Turkiye Using Multiple Deep Learning Architectures: Analysis of Performance, Complexity, and Generalization

链接https://arxiv.org/abs/2602.18530

作者:Hatice Karatas,Irfan Atabas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:study compares multiple, compares multiple deep, locally cultivated black, cultivated black table, black table olive

备注

点击查看摘要

Abstract:This study compares multiple deep learning architectures for the automated, image-based classification of five locally cultivated black table olive varieties in Turkey: Gemlik, Ayvalik, Uslu, Erkence, and Celebi. Using a dataset of 2500 images, ten architectures - MobileNetV2, EfficientNetB0, EfficientNetV2-S, ResNet50, ResNet101, DenseNet121, InceptionV3, ConvNeXt-Tiny, ViT-B16, and Swin-T - were trained using transfer learning. Model performance was evaluated using accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), Cohen's Kappa, ROC-AUC, number of parameters, FLOPs, inference time, and generalization gap. EfficientNetV2-S achieved the highest classification accuracy (95.8%), while EfficientNetB0 provided the best trade-off between accuracy and computational complexity. Overall, the results indicate that under limited data conditions, parametric efficiency plays a more critical role than model depth alone.

229. 【2602.18527】JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

链接https://arxiv.org/abs/2602.18527

作者:Zhan Liu,Changli Tang,Yuxin Wang,Zhiyuan Zhu,Youjun Chen,Yiwen Shao,Tianzi Wang,Lei Ke,Zengrui Jin,Chao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:Current audio-visual large, audio-visual large language, relying on RGB, Current audio-visual, RGB video

备注

点击查看摘要

Abstract:Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

230. 【2602.18525】Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

链接https://arxiv.org/abs/2602.18525

作者:Vasile Marian,Yong-Bin Kang,Alexander Buddery

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:predict downstream detection, object-detection training sets, training remains difficult, augment object-detection training, standard global generative

备注: 23 pages, 13 figures, includes appendix

点击查看摘要

Abstract:Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.

231. 【2602.18520】Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

链接https://arxiv.org/abs/2602.18520

作者:Aayam Bansal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Providing timely, STEM education, challenge in STEM, persistent challenge, Providing

备注

点击查看摘要

Abstract:Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.

232. 【2602.18519】Wide Open Gazes: Quantifying Visual Exploratory Behavior in Soccer with Pose Enhanced Positional Data

链接https://arxiv.org/abs/2602.18519

作者:Joris Bekkers

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:in-game future success, predict relevant short-term, relevant short-term in-game, short-term in-game future, visual exploratory behavior

备注

点击查看摘要

Abstract:Traditional approaches to measuring visual exploratory behavior in soccer rely on counting visual exploratory actions (VEAs) based on rapid head movements exceeding 125°/s, but this method suffer from player position bias (i.e., a focus on central midfielders), annotation challenges, binary measurement constraints (i.e., a player is scanning, or not), lack the power to predict relevant short-term in-game future success, and are incompatible with fundamental soccer analytics models such as pitch control. This research introduces a novel formulaic continuous stochastic vision layer to quantify players' visual perception from pose-enhanced spatiotemporal tracking. Our probabilistic field-of-view and occlusion models incorporate head and shoulder rotation angles to create speed-dependent vision maps for individual players in a two-dimensional top-down plane. We combine these vision maps with pitch control and pitch value surfaces to analyze the awaiting phase (when a player is awaiting the ball to arrive after a pass for a teammate) and their subsequent on-ball phase. We demonstrate that aggregated visual metrics - such as the percentage of defended area observed while awaiting a pass - are predictive of controlled pitch value gained at the end of dribbling actions using 32 games of synchronized pose-enhanced tracking data and on-ball event data from the 2024 Copa America. This methodology works regardless of player position, eliminates manual annotation requirements, and provides continuous measurements that seamlessly integrate into existing soccer analytics frameworks. To further support the integration with existing soccer analytics frameworks we open-source the tools required to make these calculations.

233. 【2602.18509】Depth from Defocus via Direct Optimization

链接https://arxiv.org/abs/2602.18509

作者:Holly Jackson,Caleb Adams,Ignacio Lopez-Francos,Benjamin Recht

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computationally challenging optimization, defocused images remains, optical physics, based on optical, collection of defocused

备注

点击查看摘要

Abstract:Though there exists a reasonable forward model for blur based on optical physics, recovering depth from a collection of defocused images remains a computationally challenging optimization problem. In this paper, we show that with contemporary optimization methods and reasonable computing resources, a global optimization approach to depth from defocus is feasible. Our approach rests on alternating minimization. When holding the depth map fixed, the forward model is linear with respect to the all-in-focus image. When holding the all-in-focus image fixed, the depth at each pixel can be computed independently, enabling embarrassingly parallel computation. We show that alternating between convex optimization and parallel grid search can effectively solve the depth-from-defocus problem at higher resolutions than current deep learning methods. We demonstrate our approach on benchmark datasets with synthetic and real defocus blur and show promising results compared to prior approaches. Our code is available at this http URL.

234. 【2602.18505】Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning

链接https://arxiv.org/abs/2602.18505

作者:Yurim Jang,Jaeung Lee,Dohyun Kim,Jaemin Jo,Simon S. Woo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:delete sensitive, increasingly shared, forget or delete, unlearning, private information

备注

点击查看摘要

Abstract:As pretrained models are increasingly shared on the web, ensuring that models can forget or delete sensitive, copyrighted, or private information upon request has become crucial. Machine unlearning has been proposed to address this challenge. However, current evaluations for unlearning methods rely on output-based metrics, which cannot verify whether information is completely deleted or merely suppressed at the representation level, where suppression is insufficient for true unlearning. To address this gap, we propose a novel restoration-based analysis framework that uses Sparse Autoencoders to identify class-specific expert features in intermediate layers and applies inference-time steering to quantitatively distinguish between suppression and deletion. Applying our framework to 12 major unlearning methods in image classification tasks, we find that most methods achieve high restoration rates of unlearned information, indicating that they only suppress information at the decision-boundary level, while preserving semantic features in intermediate representations. Notably, even retraining from pretrained checkpoints shows high restoration, revealing that robust semantic features inherited from pretraining are not removed by retraining. These results demonstrate that representation-level retention poses significant risks overlooked by output-based metrics, highlighting the need for new unlearning evaluation criteria. We propose new evaluation guidelines that prioritize representation-level verification, especially for privacy-critical applications in the era of pre-trained models.

235. 【2602.18504】A Computer Vision Framework for Multi-Class Detection and Tracking in Soccer Broadcast Footage

链接https://arxiv.org/abs/2602.18504

作者:Daniel Tshiani

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:expensive multi-camera setups, setups or GPS, GPS tracking systems, collect similar information, access to expensive

备注: Presented at the Robyn Rafferty Mathias Reseaerch Conference. Additional Information available at: [this https URL](https://DGT-International.com)

点击查看摘要

Abstract:Clubs with access to expensive multi-camera setups or GPS tracking systems gain a competitive advantage through detailed data, whereas lower-budget teams are often unable to collect similar information. This paper examines whether such data can instead be extracted directly from standard broadcast footage using a single-camera computer vision pipeline. This project develops an end-to-end system that combines a YOLO object detector with the ByteTrack tracking algorithm to identify and track players, referees, goalkeepers, and the ball throughout a match. Experimental results show that the pipeline achieves high performance in detecting and tracking players and officials, with strong precision, recall, and mAP50 scores, while ball detection remains the primary challenge. Despite this limitation, our findings demonstrate that AI can extract meaningful player-level spatial information from a single broadcast camera. By reducing reliance on specialized hardware, the proposed approach enables colleges, academies, and amateur clubs to adopt scalable, data-driven analysis methods previously accessible only to professional teams, highlighting the potential for affordable computer vision-based soccer analytics.

236. 【2602.18502】Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

链接https://arxiv.org/abs/2602.18502

作者:Sarah Müller,Philipp Berens

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:achieve excellent classification, target task, achieve excellent, causally related, exploiting spurious correlations

备注

点击查看摘要

Abstract:Although deep learning models in medical imaging often achieve excellent classification performance, they can rely on shortcut learning, exploiting spurious correlations or confounding factors that are not causally related to the target task. This poses risks in clinical settings, where models must generalize across institutions, populations, and acquisition conditions. Feature disentanglement is a promising approach to mitigate shortcut learning by separating task-relevant information from confounder-related features in latent representations. In this study, we systematically evaluated feature disentanglement methods for mitigating shortcuts in medical imaging, including adversarial learning and latent space splitting based on dependence minimization. We assessed classification performance and disentanglement quality using latent space analyses across one artificial and two medical datasets with natural and synthetic confounders. We also examined robustness under varying levels of confounding and compared computational efficiency across methods. We found that shortcut mitigation methods improved classification performance under strong spurious correlations during training. Latent space analyses revealed differences in representation quality not captured by classification metrics, highlighting the strengths and limitations of each method. Model reliance on shortcuts depended on the degree of confounding in the training data. The best-performing models combine data-centric rebalancing with model-centric disentanglement, achieving stronger and more robust shortcut mitigation than rebalancing alone while maintaining similar computational efficiency.

237. 【2602.18500】Scaling Ultrasound Volumetric Reconstruction via Mobile Augmented Reality

链接https://arxiv.org/abs/2602.18500

作者:Kian Wei Ng,Yujia Gao,Deborah Khoo,Ying Zhen Tan,Chengzheng Mao,Haojie Cheng,Andrew Makmur,Kee Yuan Ngiam,Serene Goh,Eng Tat Khoo

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)

关键词:Computed Tomography provide, risk stratification, Accurate volumetric characterization, oncologic diagnosis, characterization of lesions

备注: Submitted to MICCAI 2026

点击查看摘要

Abstract:Accurate volumetric characterization of lesions is essential for oncologic diagnosis, risk stratification, and treatment planning. While imaging modalities such as Computed Tomography provide high-quality 3D data, 2D ultrasound (2D-US) remains the preferred first-line modality for breast and thyroid imaging due to cost, portability, and safety factors. However, volume estimates derived from 2D-US suffer from high inter-user variability even among experienced clinicians. Existing 3D ultrasound (3D-US) solutions use specialized probes or external tracking hardware, but such configurations increase costs and diminish portability, constraining widespread clinical use. To address these limitations, we present Mobile Augmented Reality Volumetric Ultrasound (MARVUS), a resource-efficient system designed to increase accessibility to accurate and reproducible volumetric assessment. MARVUS is interoperable with conventional ultrasound (US) systems, using a foundation model to enhance cross-specialty generalization while minimizing hardware requirements relative to current 3D-US solutions. In a user study involving experienced clinicians performing measurements on breast phantoms, MARVUS yielded a substantial improvement in volume estimation accuracy (mean difference: 0.469 cm3) with reduced inter-user variability (mean difference: 0.417 cm3). Additionally, we prove that augmented reality (AR) visualizations enhance objective performance metrics and clinician-reported usability. Collectively, our findings suggests that MARVUS can enhance US-based cancer screening, diagnostic workflows, and treatment planning in a scalable, cost-conscious, and resource-efficient manner. Usage video demonstration available (this https URL).

238. 【2602.18496】A Patient-Specific Digital Twin for Adaptive Radiotherapy of Non-Small Cell Lung Cancer

链接https://arxiv.org/abs/2602.18496

作者:Anvi Sud,Jialu Huang,Gregory R. Hart,Keshav Saxena,John Kim,Lauren Tressel,Jun Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:regimens generating high, generating high frequency, high frequency imaging, streams ideally suited, dosimetry streams ideally

备注

点击查看摘要

Abstract:Radiotherapy continues to become more precise and data dense, with current treatment regimens generating high frequency imaging and dosimetry streams ideally suited for AI driven temporal modeling to characterize how normal tissues evolve with time. Each fraction in biologically guided radiotherapy(BGRT) treated non small cell lung cancer (NSCLC) patients records new metabolic, anatomical, and dose information. However, clinical decision making is largely informed by static, population based NTCP models which overlook the dynamic, unique biological trajectories encoded in sequential data. We developed COMPASS (Comprehensive Personalized Assessment System) for safe radiotherapy, functioning as a temporal digital twin architecture utilizing per fraction PET, CT, dosiomics, radiomics, and cumulative biologically equivalent dose (BED) kinetics to model normal tissue biology as a dynamic time series process. A GRU autoencoder was employed to learn organ specific latent trajectories, which were classified via logistic regression to predict eventual CTCAE grade 1 or higher toxicity. Eight NSCLC patients undergoing BGRT contributed to the 99 organ fraction observations covering 24 organ trajectories (spinal cord, heart, and esophagus). Despite the small cohort, intensive temporal phenotyping allowed for comprehensive analysis of individual dose response dynamics. Our findings revealed a viable AI driven early warning window, as increasing risk ratings occurred from several fractions before clinical toxicity. The dense BED driven representation revealed biologically relevant spatial dose texture characteristics that occur before toxicity and are averaged out with traditional volume based dosimetry. COMPASS establishes a proof of concept for AI enabled adaptive radiotherapy, where treatment is guided by a continually updated digital twin that tracks each patients evolving biological response.

239. 【2602.18466】Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

链接https://arxiv.org/abs/2602.18466

作者:Yixuan Shen,Peng He,Honglu Liu,Yuyang Ji,Tingting Li,Tianlong Chen,Kaidi Xu,Feng Liu

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:students coordinate phenomena, automated analysis elusive, made automated analysis, Generation Science Standards, science classroom discourse

备注: 17pages, 3 figures

点击查看摘要

Abstract:K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.

240. 【2602.18439】Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

链接https://arxiv.org/abs/2602.18439

作者:Suraj Prasad,Anubha Pant

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remarkable zero-shot capabilities, CLIP have demonstrated, demonstrated remarkable zero-shot, Toggle, presents significant challenges

备注: 6 pages, 2 figues

点击查看摘要

Abstract:Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \cite{Qiu2024} addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2\% of the original paper's reported accuracies, with an average accuracy of 74.58\% on seen (base) classes and 76.00\% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper's core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach.

Comments:
6 pages, 2 figues

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

ACMclasses:
I.2.6

Cite as:
arXiv:2602.18439 [cs.CV]

(or
arXiv:2602.18439v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.18439

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Anubha Pant [view email] [v1]
Mon, 24 Nov 2025 18:05:10 UTC (92 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models, by Suraj Prasad and Anubha PantView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.LG

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

241. 【2511.18765】NI-Tex: Non-isometric Image-based Garment Texture Generation

链接https://arxiv.org/abs/2511.18765

作者:Hui Shan,Ming Li,Haitao Yang,Kai Zheng,Sizhe Zheng,Yanwei Fu,Xiangru Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real-world clothing geometries, diversity remains limited, Existing industrial, texture diversity remains, extract Physically-based Rendering

备注

点击查看摘要

Abstract:Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility. To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.

242. 【2507.19418】DEFNet: Multitasks-based Deep Evidential Fusion Network for Blind Image Quality Assessment

链接https://arxiv.org/abs/2507.19418

作者:Yiwei Lou,Yuanpeng He,Rongchao Zhang,Yongzhi Cao,Hanpin Wang,Yu Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:Blind image quality, image quality assessment, incorporate auxiliary tasks, Blind image, quality assessment

备注

点击查看摘要

Abstract:Blind image quality assessment (BIQA) methods often incorporate auxiliary tasks to improve performance. However, existing approaches face limitations due to insufficient integration and a lack of flexible uncertainty estimation, leading to suboptimal performance. To address these challenges, we propose a multitasks-based Deep Evidential Fusion Network (DEFNet) for BIQA, which performs multitask optimization with the assistance of scene and distortion type classification tasks. To achieve a more robust and reliable representation, we design a novel trustworthy information fusion strategy. It first combines diverse features and patterns across sub-regions to enhance information richness, and then performs local-global information fusion by balancing fine-grained details with coarse-grained context. Moreover, DEFNet exploits advanced uncertainty estimation technique inspired by evidential learning with the help of normal-inverse gamma distribution mixture. Extensive experiments on both synthetic and authentic distortion datasets demonstrate the effectiveness and robustness of the proposed framework. Additional evaluation and analysis are carried out to highlight its strong generalization capability and adaptability to previously unseen scenarios.

243. 【2602.19891】Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images

链接https://arxiv.org/abs/2602.19891

作者:Wen-Liang Lin,Yun-Chien Cheng

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Tomography Pulmonary Angiography, Computed Tomography Pulmonary, demonstrated considerable promise, Computed Tomography, Pulmonary Angiography

备注

点击查看摘要

Abstract:While deep learning has demonstrated considerable promise in computer-aided diagnosis for pulmonary embolism (PE), practical deployment in Computed Tomography Pulmonary Angiography (CTPA) is often hindered by "domain shift" and the prohibitive cost of expert annotations. To address these challenges, an unsupervised domain adaptation (UDA) framework is proposed, utilizing a Transformer backbone and a Mean-Teacher architecture for cross-center semantic segmentation. The primary focus is placed on enhancing pseudo-label reliability by learning deep structural information within the feature space. Specifically, three modules are integrated and designed for this task: (1) a Prototype Alignment (PA) mechanism to reduce category-level distribution discrepancies; (2) Global and Local Contrastive Learning (GLCL) to capture both pixel-level topological relationships and global semantic representations; and (3) an Attention-based Auxiliary Local Prediction (AALP) module designed to reinforce sensitivity to small PE lesions by automatically extracting high-information slices from Transformer attention maps. Experimental validation conducted on cross-center datasets (FUMPE and CAD-PE) demonstrates significant performance gains. In the FUMPE - CAD-PE task, the IoU increased from 0.1152 to 0.4153, while the CAD-PE - FUMPE task saw an improvement from 0.1705 to 0.4302. Furthermore, the proposed method achieved a 69.9% Dice score in the CT - MRI cross-modality task on the MMWHS dataset without utilizing any target-domain labels for model selection, confirming its robustness and generalizability for diverse clinical environments.

244. 【2602.19055】Automated Disentangling Analysis of Skin Colour for Lesion Images

链接https://arxiv.org/abs/2602.19055

作者:Wenbo Yang,Eman Rezk,Walaa M. Moursi,Zhou Wang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Machine-learning models working, Machine-learning models, differs between training, training and deployment, models working

备注

点击查看摘要

Abstract:Machine-learning models working on skin images often have degraded performance when the skin colour captured in images (SCCI) differs between training and deployment. Such differences arise from entangled environmental factors (e.g., illumination, camera settings), and intrinsic factors (e.g., skin tone) that cannot be accurately described by a single "skin tone" scalar. To mitigate such colour mismatch, we propose a skin-colour disentangling framework that adapts disentanglement-by-compression to learn a structured, manipulable latent space for SCCI from unlabelled dermatology images. To prevent information leakage that hinders proper learning of dark colour features, we introduce a randomized, mostly monotonic decolourization mapping. To suppress unintended colour shifts of localized patterns (e.g., ink marks, scars) during colour manipulation, we further propose a geometry-aligned post-processing step. Together, these components enable faithful counterfactual editing and answering an essential question: "What would this skin condition look like under a different SCCI?", as well as direct colour transfer between images and controlled traversal along physically meaningful directions (e.g., blood perfusion, camera white balance), enabling educational visualization of skin conditions under varying SCCI. We demonstrate that dataset-level augmentation and colour normalization based on our framework achieve competitive lesion classification performance.

245. 【2602.18883】Characterization of Residual Morphological Substructure Using Supervised and Unsupervised Deep Learning

链接https://arxiv.org/abs/2602.18883

作者:Kameswara Bharadwaj Mantha,Daniel H. McIntosh,Cody Ciaschi,Rubyet Evan,Luther Landry,Henry C. Ferguson,Camilla Pacifici,Joel Primack,Nimish Hathi,Anton Koekemoer,Yicheng Guo, TheCANDELS Collaboration

类目:Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)

关键词:transformative physical processes, physical processes driving, Automated characterization, driving galaxy evolution, processes driving galaxy

备注: This manuscript is a preprint that has not undergone peer review and is being shared to ensure dissemination and community access to the results and insights (see acknowledgements)

点击查看摘要

Abstract:Automated characterization of galactic substructure is an essential step in understanding the transformative physical processes driving galaxy evolution. In this study, we investigate the application of deep learning (DL) frameworks to characterize different galactic substructures hosted within parametric light-profile subtracted ``residual'' images of a large sample galaxies from the CANDELS survey. We develop a supervised Convolutional Neural Network (CNN) and unsupervised Convolutional Variational Autoencoder (CvAE) and train it on the single-Sérsic profile fitting based residual images of $10,046$ bright and massive galaxies ($H24.5\,{\rm mag}$ and $M_{\rm stellar} \geq 10^{9.5}\,M_{\odot}$) spanning $1z3$, in conjunction with their visual-based classification labels indicating the nature of residual substructures hosted within them. Using our unique data preprocessing approach, we prepare our residual images such that the inputs to our DL networks comprise only ``galaxy of interest'', and augment them such that our sample span uniformly across different residual characteristics. We assess the latent space of the CNN and CvAE using Principle Component Analysis (PCA) along with independently quantified metrics of residual strength (significant pixel flux $SPF$, Bumpiness, and Residual Flux Fraction). We also employ an unsupervised Gaussian Mixture Modeling (GMM) based clustering scheme with Support Vector Classification (SVC) to identify groupings in PCA space that correspond to similar residual substructure. We find that our supervised CNN latent features in PCA space correlate with the $SPF$ values and distinguish between qualitatively strong and weak residual substructures. While our unsupervised CvAE latent space also correlates with visual and quantitative residual characteristics, but lacks clear discriminatory power when characterizing different residual substructures.

246. 【2602.18863】IACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking

链接https://arxiv.org/abs/2602.18863

作者:Abdullah All Tanvir,Agnibh Dasgupta,Xin Zhong

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:complex optical degradations, deep watermarking systems, recapture introduces complex, introduces complex optical, Camera recapture introduces

备注: This paper is accepted to CVPR 2026

点击查看摘要

Abstract:Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moiré interference, that remain challenging for deep watermarking systems. We present TIACam, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a learnable auto-augmentor that discovers camera-like distortions through differentiable geometric, photometric, and Moiré operators; (2) a text-anchored invariant feature learner that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a zero-watermarking head that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.

247. 【2602.18690】Neural Fields as World Models

链接https://arxiv.org/abs/2602.18690

作者:Joshua Nunley

类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:brain predict physical, predict physical outcomes, brain predict, predict physical, physical outcomes

备注: 6 pages, 6 figures. Submitted to the Annual Meeting of the Cognitive Science Society (CogSci 2026)

点击查看摘要

Abstract:How does the brain predict physical outcomes while acting in the world? Machine learning world models compress visual input into latent spaces, discarding the spatial structure that characterizes sensory cortex. We propose isomorphic world models: architectures preserving sensory topology so that physics prediction becomes geometric propagation rather than abstract state transition. We implement this using neural fields with motor-gated channels, where activity evolves through local lateral connectivity and motor commands multiplicatively modulate specific populations. Three experiments support this approach: (1) local connectivity is sufficient to learn ballistic physics, with predictions traversing intermediate locations rather than "teleporting"; (2) policies trained entirely in imagination transfer to real physics at nearly twice the rate of latent-space alternatives; and (3) motor-gated channels spontaneously develop body-selective encoding through visuomotor prediction alone. These findings suggest intuitive physics and body schema may share a common origin in spatially structured neural dynamics.

248. 【2602.18642】Auto Quantum Machine Learning for Multisource Classification

链接https://arxiv.org/abs/2602.18642

作者:Tomasz Rybotycki,Sebastian Dziura,Piotr Gawron

类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:data-intensive scientific fields, fault-tolerant quantum computing, applying quantum computational, remote sensing, growing interest

备注: 15 pages, 4 figures, 3 tables. Submitted to ICCS2026

点击查看摘要

Abstract:With fault-tolerant quantum computing on the horizon, there is growing interest in applying quantum computational methods to data-intensive scientific fields like remote sensing. Quantum machine learning (QML) has already demonstrated potential for such demanding tasks. One area of particular focus is quantum data fusion -- a complex data analysis problem that has attracted significant recent attention. In this work, we introduce an automated QML (AQML) approach for addressing data fusion challenges. We evaluate how AQML-generated quantum circuits perform compared to classical multilayer perceptrons (MLPs) and manually designed QML models when processing multisource inputs. Furthermore, we apply our method to change detection using the multispectral ONERA dataset, achieving improved accuracy over previously reported QML-based change detection results.

249. 【2602.18589】DM4CT: Benchmarking Diffusion Models for Computed Tomography Reconstruction

链接https://arxiv.org/abs/2602.18589

作者:Jiayang Shi,Daniel M. Pelt,K. Joost Batenburg

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:solving inverse problems, Diffusion models, linear inverse problem, recently emerged, emerged as powerful

备注: ICLR 2026

点击查看摘要

Abstract:Diffusion models have recently emerged as powerful priors for solving inverse problems. While computed tomography (CT) is theoretically a linear inverse problem, it poses many practical challenges. These include correlated noise, artifact structures, reliance on system geometry, and misaligned value ranges, which make the direct application of diffusion models more difficult than in domains like natural image generation. To systematically evaluate how diffusion models perform in this context and compare them with established reconstruction methods, we introduce DM4CT, a comprehensive benchmark for CT reconstruction. DM4CT includes datasets from both medical and industrial domains with sparse-view and noisy configurations. To explore the challenges of deploying diffusion models in practice, we additionally acquire a high-resolution CT dataset at a high-energy synchrotron facility and evaluate all methods under real experimental conditions. We benchmark ten recent diffusion-based methods alongside seven strong baselines, including model-based, unsupervised, and supervised approaches. Our analysis provides detailed insights into the behavior, strengths, and limitations of diffusion models for CT reconstruction. The real-world dataset is publicly available at this http URL, and the codebase is open-sourced at this http URL.

250. 【2602.18542】4D-UNet improves clutter rejection in human transcranial contrast enhanced ultrasound

链接https://arxiv.org/abs/2602.18542

作者:Tristan Beruard,Armand Delbos,Arthur Chavignon,Maxence Reberol,Vincent Hingot

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:high skull absorption, limiting vascular imaging, skull absorption, limiting vascular, largest vessels

备注: 9 pages, 7 figures

点击查看摘要

Abstract:Transcranial ultrasound imaging is limited by high skull absorption, limiting vascular imaging to only the largest vessels. Traditional clutter filters struggle with low signal-to-noise ratio (SNR) ultrasound datasets, where blood and tissue signals cannot be easily separated, even when the echogenicity of the blood is improved with contrast agents. Here, we present a novel 4D U-Net approach for clutter filtering in transcranial 3D Contrast Enhanced Ultrasound (CEUS) exploiting spatial and temporal information via a 4D-UNet implementation to enhance microbubble detection in transcranial data acquired in human adults. Our results show that the 4D-UNet improves temporal clutter filters. By integrating deep learning into CEUS, this study advances neurovascular imaging, offering improved clutter rejection and visualization. The findings underscore the potential of AI-driven approaches to enhance ultrasound-based medical imaging, paving the way for more accurate diagnostics and broader clinical applications.

251. 【2602.18536】riggering hallucinations in model-based MRI reconstruction via adversarial perturbations

链接https://arxiv.org/abs/2602.18536

作者:Suna Buğday,Yvan Saeys,Jonathan Peck

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:computed tomography, hallucinations, Generative models, medical imaging, magnetic resonance

备注: 20 pages

点击查看摘要

Abstract:Generative models are increasingly used to improve the quality of medical imaging, such as reconstruction of magnetic resonance images and computed tomography. However, it is well-known that such models are susceptible to hallucinations: they may insert features into the reconstructed image which are not actually present in the original image. In a medical setting, such hallucinations may endanger patient health as they can lead to incorrect diagnoses. In this work, we aim to quantify the extent to which state-of-the-art generative models suffer from hallucinations in the context of magnetic resonance image reconstruction. Specifically, we craft adversarial perturbations resembling random noise for the unprocessed input images which induce hallucinations when reconstructed using a generative model. We perform this evaluation on the brain and knee images from the fastMRI data set using UNet and end-to-end VarNet architectures to reconstruct the images. Our results show that these models are highly susceptible to small perturbations and can be easily coaxed into producing hallucinations. This fragility may partially explain why hallucinations occur in the first place and suggests that a carefully constructed adversarial training routine may reduce their prevalence. Moreover, these hallucinations cannot be reliably detected using traditional image quality metrics. Novel approaches will therefore need to be developed to detect when hallucinations have occurred.