本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新517篇论文,其中:
- 自然语言处理59篇
- 信息检索15篇
- 计算机视觉125篇
自然语言处理
1. 【2602.21202】Multi-Vector Index Compression in Any Modality
链接:https://arxiv.org/abs/2602.21202
作者:Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:study efficient multi-vector, efficient multi-vector retrieval, late interaction, study efficient, efficient multi-vector
备注: 12 pages, 4 figures
点击查看摘要
Abstract:We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: this http URL.
2. 【2602.21201】Aletheia tackles FirstProof autonomously
链接:https://arxiv.org/abs/2602.21201
作者:Tony Feng,Junehyuk Jung,Sang-hyun Kim,Carlo Pagano,Sergei Gukov,Chiang-Chiang Tsai,David Woodruff,Adel Javanmard,Aryan Mokhtari,Dawsen Hwang,Yuri Chervonyi,Jonathan N. Lee,Garrett Bingham,Trieu H. Trinh,Vahab Mirrokni,Quoc V. Le,Thang Luong
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:mathematics research agent, research agent powered, powered by Gemini, inaugural FirstProof challenge, report the performance
备注: 34 pages. Project page: [this https URL](https://github.com/google-deepmind/superhuman/tree/main/aletheia)
点击查看摘要
Abstract:We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at this https URL.
3. 【2602.21198】Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
链接:https://arxiv.org/abs/2602.21198
作者:Yining Hong,Huang Huang,Manling Li,Li Fei-Fei,Jiajun Wu,Yejin Choi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Embodied LLMs endow, high-level task reasoning, LLMs endow robots, Embodied LLMs, Reflective Test-Time Planning
备注:
点击查看摘要
Abstract:Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
4. 【2602.21193】On Data Engineering for Scaling LLM Terminal Capabilities
链接:https://arxiv.org/abs/2602.21193
作者:Renjie Pi,Grace Lam,Mohammad Shoeybi,Pooya Jannaty,Bryan Catanzaro,Wei Ping
类目:Computation and Language (cs.CL)
关键词:remain largely undisclosed, rapid recent progress, agents remain largely, terminal agents remain, large language models
备注:
点击查看摘要
Abstract:Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at this https URL.
5. 【2602.21165】PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
链接:https://arxiv.org/abs/2602.21165
作者:Samah Fodeh,Linhai Ma,Yan Wang,Srivani Talakokkul,Ganesh Puthiaraju,Afshan Khan,Ashley Hagaman,Sarah Lowe,Aimee Roundtree
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reflecting communicative behaviors, Patient-generated text, reflecting communicative, interviews contains rich, rich expressions
备注:
点击查看摘要
Abstract:Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.
6. 【2602.21158】SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
链接:https://arxiv.org/abs/2602.21158
作者:Dengjia Zhang,Xiaoou Liu,Lu Cheng,Yaqing Wang,Kenton Murray,Hua Wei
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, Large language, multi-step decision-making agents, Evolving LLM Agent, effective reward design
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.
7. 【2602.21143】A Benchmark for Deep Information Synthesis
链接:https://arxiv.org/abs/2602.21143
作者:Debjit Paul,Daniel Murphy,Milan Gritta,Ronald Cardenas,Victor Prokhorov,Lena Sophia Bolliger,Aysim Toker,Roy Miles,Andreea-Maria Oncescu,Jasivan Alex Sivakumar,Philipp Borchert,Ismail Elezi,Meiru Zhang,Ka Yiu Lee,Guchun Zhang,Jun Wang,Gerasimos Lampouras
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large language model, complex tasks involving, tasks involving tool, code execution, solve complex tasks
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.
8. 【2602.21103】Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
链接:https://arxiv.org/abs/2602.21103
作者:Sanket Badhe,Deep Shah
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:test-time inference costs, Advanced reasoning typically, substantial test-time inference, reasoning typically requires, incurs prohibitive latency
备注:
点击查看摘要
Abstract:Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.
9. 【2602.21082】Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification
链接:https://arxiv.org/abs/2602.21082
作者:Vishal Patil,Shree Vaishnavi Bacha,Revanth Yamani,Yidan Sun,Mayank Kejriwal
类目:Computation and Language (cs.CL)
关键词:Customer-provided reviews, important source, source of information, information for business, business owners
备注:
点击查看摘要
Abstract:Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.
10. 【2602.21059】An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
链接:https://arxiv.org/abs/2602.21059
作者:Anna Martin-Boyle,William Humphreys,Martha Brown,Cara Leckey,Harmanpreet Kaur
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, reliability remains uncertain, transforming scholarly tasks
备注: 24 pages, 2 figures. Accepted at ACM CHI conference on Human Factors in Computing Systems, 2026
点击查看摘要
Abstract:Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.
11. 【2602.21054】VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
链接:https://arxiv.org/abs/2602.21054
作者:Seongheon Park,Changdae Oh,Hyeong Kyu Choi,Xuefeng Du,Sharon Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, frequently hallucinate, limiting their safe, real-world applications
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
12. 【2602.21045】PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly QA
链接:https://arxiv.org/abs/2602.21045
作者:Anna Martin-Boyle,Cara A.C. Leckey,Martha C. Brown,Harmanpreet Kaur
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:Large language models, synthesize vast amounts, Large language, researchers synthesize vast, language models
备注: 25 pages, 3 figures. Accepted at the ACM CHI conference on Human Factors in Computing Systems 2026
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.
13. 【2602.21009】HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
链接:https://arxiv.org/abs/2602.21009
作者:Kun Yuan,Junyu Bi,Daixuan Cheng,Changfa Wu,Shuwen Xiao,Binbin Cao,Jian Wu,Yuning Jiang
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Modern recommender systems, recommender systems leverage, systems leverage ultra-long, leverage ultra-long user, Modern recommender
备注:
点击查看摘要
Abstract:Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints. While summarizing history via interest centers offers a practical alternative, existing methods struggle to (1) identify user-specific centers at appropriate granularity and (2) accurately assign behaviors, leading to quantization errors and loss of long-tail preferences. To alleviate these issues, we propose Hierarchical Sparse Activation Compression (HiSAC), an efficient framework for personalized sequence modeling. HiSAC encodes interactions into multi-level semantic IDs and constructs a global hierarchical codebook. A hierarchical voting mechanism sparsely activates personalized interest-agents as fine-grained preference centers. Guided by these agents, Soft-Routing Attention aggregates historical signals in semantic space, weighting by similarity to minimize quantization error and retain long-tail behaviors. Deployed on Taobao's "Guess What You Like" homepage, HiSAC achieves significant compression and cost reduction, with online A/B tests showing a consistent 1.65% CTR uplift -- demonstrating its scalability and real-world effectiveness.
14. 【2602.20995】Generative Pseudo-Labeling for Pre-Ranking with LLMs
链接:https://arxiv.org/abs/2602.20995
作者:Junyu Bi,Xinting Niu,Daixuan Cheng,Kun Yuan,Tao Wang,Binbin Cao,Jian Wu,Yuning Jiang
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:efficiently scoring thousands, tasked with efficiently, downstream ranking, critical stage, stage in industrial
备注:
点击查看摘要
Abstract:Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discrepancy: pre-ranking models are trained only on exposed interactions, yet must score all recalled candidates -- including unexposed items -- during online serving. This mismatch not only induces severe sample selection bias but also degrades generalization, especially for long-tail content. Existing debiasing approaches typically rely on heuristics (e.g., negative sampling) or distillation from biased rankers, which either mislabel plausible unexposed items as negatives or propagate exposure bias into pseudo-labels. In this work, we propose Generative Pseudo-Labeling (GPL), a framework that leverages large language models (LLMs) to generate unbiased, content-aware pseudo-labels for unexposed items, explicitly aligning the training distribution with the online serving space. By offline generating user-specific interest anchors and matching them with candidates in a frozen semantic space, GPL provides high-quality supervision without adding online latency. Deployed in a large-scale production system, GPL improves click-through rate by 3.07%, while significantly enhancing recommendation diversity and long-tail item discovery.
15. 【2602.20976】Evaluating Proactive Risk Awareness of Large Language Models
链接:https://arxiv.org/abs/2602.20976
作者:Xuan Luo,Yubin Chen,Zhiyu Hou,Linpu Yu,Geng Tu,Jing Li,Ruifeng Xu
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:explicit harmful intent, large language models, safety responsibilities extend, everyday decision-making, increasingly embedded
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.
16. 【2602.20973】Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving
链接:https://arxiv.org/abs/2602.20973
作者:Yuliang Ji,Fuchen Shen,Jian Wu,Qiujie Xie,Yue Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, capabilities of Large, introduced abundant mathematical, Large Language, researchers have introduced
备注:
点击查看摘要
Abstract:To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs' reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.
17. 【2602.20966】Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models
链接:https://arxiv.org/abs/2602.20966
作者:Paola Merlo,Chunyang Jiang,Giuseppe Samo,Vivi Nastase
类目:Computation and Language (cs.CL)
关键词:Blackbird Language Matrices, Language Matrices, Blackbird Language, inspired by intelligence, intelligence tests
备注: Under review, 46 pages, 5 tables, 28 figures
点击查看摘要
Abstract:This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.
Comments:
Under review, 46 pages, 5 tables, 28 figures
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.20966 [cs.CL]
(or
arXiv:2602.20966v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.20966
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2602.20945】he Art of Efficient Reasoning: Data, Reward, and Optimization
链接:https://arxiv.org/abs/2602.20945
作者:Taiqiang Wu,Zenan Zu,Bo Zhou,Ngai Wong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, heavy computational overhead, consistently benefit
备注: Tech Report, Insights on Efficient Reasoning via Reward Shaping
点击查看摘要
Abstract:Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.
19. 【2602.20918】Predicting Sentence Acceptability Judgments in Multimodal Contexts
链接:https://arxiv.org/abs/2602.20918
作者:Hyewon Jang,Nikolai Ilinykh,Sharid Loáiciga,Jey Han Lau,Shalom Lappin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:deep neural networks, neural networks, visual contexts, examined the capacity, capacity of deep
备注:
点击查看摘要
Abstract:Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.
20. 【2602.20892】Exa-PSD: a new Persian sentiment analysis dataset on Twitter
链接:https://arxiv.org/abs/2602.20892
作者:Seyed Himan Ghaderi,Saeed Sarbazi Azad,Mohammad Mehdi Jaziriyan,Ahmad Akbari
类目:Computation and Language (cs.CL)
关键词:widely used platforms, platforms for communication, Persian, Sentiment analysis, communication of people
备注: This is the original submitted (preprint) version of a paper published in Language Resources and Evaluation. The final published version is available at Springer via DOI: [this https URL](https://doi.org/10.1007/s10579-025-09886-5)
点击查看摘要
Abstract:Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained Pars Bert and Roberta as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.
21. 【2602.20859】FinAnchor: Aligned Multi-Model Representations for Financial Prediction
链接:https://arxiv.org/abs/2602.20859
作者:Zirui He,Huopu Zhang,Yanguang Liu,Sirui Wu,Mengnan Du
类目:Computation and Language (cs.CL)
关键词:involves significant challenges, long documents involves, documents involves significant, generating embeddings varies, significant challenges
备注: 11 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Financial prediction from long documents involves significant challenges, as actionable signals are often sparse and obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods. In this paper, we propose FinAnchor(Financial Anchored Representations), a lightweight framework that integrates embeddings from multiple LLMs without fine-tuning the underlying models. FinAnchor addresses the incompatibility of feature spaces by selecting an anchor embedding space and learning linear mappings to align representations from other models into this anchor. These aligned features are then aggregated to form a unified representation for downstream prediction. Across multiple financial NLP tasks, FinAnchor consistently outperforms strong single-model baselines and standard ensemble methods, demonstrating the effectiveness of anchoring heterogeneous representations for robust financial prediction.
22. 【2602.20816】Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
链接:https://arxiv.org/abs/2602.20816
作者:Sayantan Dasgupta,Trevor Cohn,Timothy Baldwin
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:core learning signal, language model distillation, standard Kullback-Leibler, core learning, learning signal
备注:
点击查看摘要
Abstract:The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.
23. 【2602.20759】Overton Pluralistic Reinforcement Learning for Large Language Models
链接:https://arxiv.org/abs/2602.20759
作者:Yu Fu,Seongho Son,Ilija Bogunovic
类目:Computation and Language (cs.CL)
关键词:Existing alignment paradigms, alignment paradigms remain, paradigms remain limited, Existing alignment, Overton Pluralism
备注: 28 pages, 8 figures
点击查看摘要
Abstract:Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.
24. 【2602.20751】SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
链接:https://arxiv.org/abs/2602.20751
作者:Yifei Xu,Guilherme Potje,Shivam Shandilya,Tiancheng Yuan,Leonardo de Oliveira Nunes,Rakshanda Agarwal,Saeid Asgari,Adam Atkinson,Emre Kıcıman,Songwu Lu,Ranveer Chandra,Tusher Chakraborty
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Designing aligned, open-ended generation remains, aligned and robust, generation remains, remains a key
备注:
点击查看摘要
Abstract:Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.
25. 【2602.20749】Explicit Grammar Semantic Feature Fusion for Robust Text Classification
链接:https://arxiv.org/abs/2602.20749
作者:Azrin Sultana,Firoz Ahmed
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Language Processing enables, Processing enables computers, understand human language, Natural Language
备注: 30 pages, 9 figures
点击查看摘要
Abstract:Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices
26. 【2602.20743】Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization
链接:https://arxiv.org/abs/2602.20743
作者:Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi
类目:Computation and Language (cs.CL)
关键词:Anonymizing textual documents, highly context-sensitive problem, Anonymizing textual, utility preservation varies, context-sensitive problem
备注:
点击查看摘要
Abstract:Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization strategies are automatically adapted to specific privacy-utility requirements. We propose a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns. To evaluate our approach, we present a benchmark spanning five datasets with diverse domains, privacy constraints, and utility objectives. Across all evaluated settings, our framework consistently achieves a better privacy-utility trade-off than existing baselines, while remaining computationally efficient and effective on open-source language models, with performance comparable to larger closed-source models. Additionally, we show that our method can discover novel anonymization strategies that explore different points along the privacy-utility trade-off frontier.
27. 【2602.20735】RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
链接:https://arxiv.org/abs/2602.20735
作者:Kun Ran,Marwah Alaofi,Danula Hettiachchi,Chenglong Ma,Khoi Nguyen Dinh Anh,Khoi Vo Nguyen,Sachin Pathiyan Cherumanal,Lida Rashidi,Falk Scholer,Damiano Spina,Shuoqi Sun,Oleg Zendel
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Toggle, Toggle Hugging Face, Code Toggle Papers, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic
备注: MMU-RAG NeurIPS 2025 winning system
点击查看摘要
Abstract:This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG (R2RAG), a research-focused retrieval-augmented generation (RAG) architecture composed of lightweight components that dynamically adapt the retrieval strategy based on inferred query complexity and evidence sufficiency. The system uses smaller LLMs, enabling operation on a single consumer-grade GPU while supporting complex research tasks. It builds on the G-RAG system, winner of the ACM~SIGIR~2025 LiveRAG Challenge, and extends it with modules informed by qualitative review of outputs. R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
Comments:
MMU-RAG NeurIPS 2025 winning system
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.20735 [cs.IR]
(or
arXiv:2602.20735v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.20735
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Oleg Zendel [view email] [v1]
Tue, 24 Feb 2026 09:58:25 UTC (93 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition, by Kun Ran and 11 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.IR
prev
|
next
new
|
recent
| 2026-02
Change to browse by:
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
28. 【2602.20727】ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
链接:https://arxiv.org/abs/2602.20727
作者:Xindian Ma,Rundong Kong,Peng Zhang,Ruoxiang Huang,Yongyu Jiang
类目:Computation and Language (cs.CL)
关键词:equips Large Language, Large Language Models, Large Language, equips Large, Language Models
备注:
点击查看摘要
Abstract:LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation lies in extracting and reusing clustered parameter groups from the pretrained weight matrix. These groups are then used to form multiple low-rank components, all of which share only a single initialized trainable low-rank matrix. This approach cuts the number of trainable parameters while keeping the model's capacity intact. We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment. ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (e.g., LoRA, DoRA, HydraLoRA) while using up to 46% fewer trainable parameters than the standard LoRA. In multi-task scenarios, it surpasses LoRA and its recent variants (e.g., DoRA and HydraLoRA) on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.
29. 【2602.20710】Counterfactual Simulation Training for Chain-of-Thought Faithfulness
链接:https://arxiv.org/abs/2602.20710
作者:Peter Hase,Christopher Potts
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:CST, Inspecting, LLM produced, CoT, Counterfactual Simulation Training
备注:
点击查看摘要
Abstract:Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at this https URL
30. 【2602.20670】CAMEL: Confidence-Gated Reflection for Reward Modeling
链接:https://arxiv.org/abs/2602.20670
作者:Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Kun Xu,Yang You
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reward models play, aligning large language, large language models, Reward models, play a fundamental
备注: Preprint. 13 pages
点击查看摘要
Abstract:Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.
31. 【2602.20648】CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models
链接:https://arxiv.org/abs/2602.20648
作者:Anqi Li,Chenxiao Wang,Yu Lu,Renjun Xu,Lizhi Ma,Zhenzhong Lan
类目:Computation and Language (cs.CL)
关键词:CARE, perceptions remains challenging, counseling effectiveness, interpretable rationales, lack interpretable rationales
备注: 14 pages, 4 figures
点击查看摘要
Abstract:Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts. Built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales, CARE is fine-tuned using rationale-augmented supervision with the LLaMA-3.1-8B-Instruct backbone. Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings. Rationale-augmented supervision further improves predictive accuracy. CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations. Applied to real-world Chinese online counseling sessions, CARE uncovers common alliance-building challenges, illustrates how interaction patterns shape alliance development, and provides actionable insights, demonstrating its potential as an AI-assisted tool for supporting mental health care.
32. 【2602.20647】Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books
链接:https://arxiv.org/abs/2602.20647
作者:W. Frederick Zimmerman
类目:Computation and Language (cs.CL)
关键词:Piecewise Aggregate Approximation, paragraph sentence embedding, cosine distance, paragraph sentence, preceding paragraphs
备注: six figures. dataset available at Hugging Face
点击查看摘要
Abstract:I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: this https URL
33. 【2602.20634】Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches
链接:https://arxiv.org/abs/2602.20634
作者:Saurabh Mishra,Shivani Thakur,Radhika Mamidi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:social media platforms, hate speech, moderation tools, social media, media platforms
备注: 32 pages, 24 figures
点击查看摘要
Abstract:The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools. This study evaluates the efficacy of various machine learning models in identifying hate speech and offensive language and investigates the potential of text transformation techniques to neutralize such content. We compare traditional models like CNNs and LSTMs with advanced neural network models such as BERT and its derivatives, alongside exploring hybrid models that combine different architectural features. Our results indicate that while advanced models like BERT show superior accuracy due to their deep contextual understanding, hybrid models exhibit improved capabilities in certain scenarios. Furthermore, we introduce innovative text transformation approaches that convert negative expressions into neutral ones, thereby potentially mitigating the impact of harmful content. The implications of these findings are discussed, highlighting the strengths and limitations of current technologies and proposing future directions for more robust hate speech detection systems.
34. 【2602.20610】SpecMind: Cognitively Inspired, Interactive Multi-Turn Framework for Postcondition Inference
链接:https://arxiv.org/abs/2602.20610
作者:Cuong Chi Le,Minh V.T Pham,Tung Vu Duy,Cuong Duc Van,Huy N. Phan,Hoang N. Phan,Tien N. Nguyen
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:manually remains challenging, challenging and time-intensive, vital for ensuring, writing them manually, manually remains
备注:
点击查看摘要
Abstract:Specifications are vital for ensuring program correctness, yet writing them manually remains challenging and time-intensive. Recent large language model (LLM)-based methods have shown successes in generating specifications such as postconditions, but existing single-pass prompting often yields inaccurate results. In this paper, we present SpecMind, a novel framework for postcondition generation that treats LLMs as interactive and exploratory reasoners rather than one-shot generators. SpecMind employs feedback-driven multi-turn prompting approaches, enabling the model to iteratively refine candidate postconditions by incorporating implicit and explicit correctness feedback, while autonomously deciding when to stop. This process fosters deeper code comprehension and improves alignment with true program behavior via exploratory attempts. Our empirical evaluation shows that SpecMind significantly outperforms state-of-the-art approaches in both accuracy and completeness of generated postconditions.
35. 【2602.20580】Personal Information Parroting in Language Models
链接:https://arxiv.org/abs/2602.20580
作者:Nishant Subramani,Kshitish Ghate,Mona Diab
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:increasing privacy risks, Modern language models, Modern language, personal information, LMs memorize
备注: EACL Findings 2026
点击查看摘要
Abstract:Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (RR) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.
36. 【2602.20574】GATES: Self-Distillation under Privileged Context with Consensus Gating
链接:https://arxiv.org/abs/2602.20574
作者:Alex Stein,Furong Huang,Tom Goldstein
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:ground truth labels, verifiable rewards, truth labels, study self-distillation, self-distillation in settings
备注: 10 Pages of main text with an additional 7 pages of supplementary material
点击查看摘要
Abstract:We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Conditioned on this reliability signal, we distill knowledge through full tutor reasoning trajectories (not just final answers), providing a dense and stable learning signal. Empirically, this consensus-gated trajectory distillation substantially improves transfer to the document-free student. Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
37. 【2602.20532】Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
链接:https://arxiv.org/abs/2602.20532
作者:Zhengyao Gu,Jonathan Light,Raul Astudillo,Ziyu Ye,Langzhou He,Henry Peng Zou,Wei Cheng,Santiago Paternain,Philip S. Yu,Yisong Yue
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:learning typically relies, making effective curriculum, reinforcement learning typically, effective curriculum learning, reinforcement learning post-training
备注: 37 pages, 8 figures, 1 table. Preprint under review. Equal contribution by first two authors
点击查看摘要
Abstract:Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
38. 【2602.20528】Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
链接:https://arxiv.org/abs/2602.20528
作者:Justin Lovelace,Christian Belardi,Sofian Zalouk,Adhitya Polavaram,Srivatsa Kundurthy,Kilian Q. Weinberger
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:integrates latent diffusion, integrates latent, latent diffusion planning, latent diffusion, Language Diffusion Model
备注: COLM 2025
点击查看摘要
Abstract:The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a "thinking" phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. The architecture also allows straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.
39. 【2602.20517】Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
链接:https://arxiv.org/abs/2602.20517
作者:Rakshit Trivedi,Kartik Sharma,David C Parkes
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Effective human-AI coordination, coordination requires artificial, requires artificial agents, human-AI coordination requires, changing contexts
备注: Spotlight paper at NeurIPS 2025
点击查看摘要
Abstract:Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations. We open source our code and provide pre-trained MIMIC agents and qualitative demos at: this https URL.
40. 【2602.20513】From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility
链接:https://arxiv.org/abs/2602.20513
作者:Gavin Levinson,Keith Feldman
类目:Computation and Language (cs.CL)
关键词:completing discrete tasks, diverse real-world systems, continue to improve, discrete tasks, real-world systems
备注:
点击查看摘要
Abstract:As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model's fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are rarely organized in a way that supports consistent evaluation, leaving no unified taxonomy for assessing and comparing LLM utility across use cases. To address this gap, we introduce the Language Model Utility Taxonomy (LUX), a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance. Within each domain, LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics that enable quantitative comparison and alignment of model selection with intended use. In addition, an external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics (factors) for applied evaluation.
41. 【2602.20459】PreScience: A Benchmark for Forecasting Scientific Contributions
链接:https://arxiv.org/abs/2602.20459
作者:Anirudh Ajith,Amanpreet Singh,Jay DeYoung,Nadav Kunievsky,Austin C. Kozlowski,Oyvind Tafjord,James Evans,Daniel S. Weld,Tom Hope,Doug Downey
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:advances that follow, systems trained, fixed point, point in time, time forecast
备注: 10 pages (53 with bibliography and appendix), 4 figures (13 with appendix), 4 tables (10 with appendix), 1 algorithm
点击查看摘要
Abstract:Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.
42. 【2602.20449】Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference
链接:https://arxiv.org/abs/2602.20449
作者:Anna Hart,Chi Han,Jeonghwan Kim,Huimin Zhao,Heng Ji
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
关键词:Modern Protein Language, Modern Protein, natural language processing, natural language, natural language domain
备注:
点击查看摘要
Abstract:Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.
43. 【2602.20433】Disentangling Geometry, Performance, and Training in Language Models
链接:https://arxiv.org/abs/2602.20433
作者:Atharva Kulkarni,Jacob Mitchell Springer,Arjun Subramonian,Swabha Swayamdipta
类目:Computation and Language (cs.CL)
关键词:properties of Transformer, model interpretability research, interpretability research, Transformer weights, unembedding matrix
备注:
点击查看摘要
Abstract:Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model's performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model's geometry, as captured by existing metrics, primarily reflects training choices rather than performance.
44. 【2602.20423】MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
链接:https://arxiv.org/abs/2602.20423
作者:Taha Koleilat,Hojat Asgariandehkordi,Omid Nejati Manzari,Berardino Barile,Yiming Xiao,Hassan Rivaz
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:ambiguous anatomical features, Medical image segmentation, remains challenging due, image segmentation remains, Medical image
备注: CVPR 2026; Project Page: [this https URL](https://tahakoleilat.github.io/MedCLIPSeg)
点击查看摘要
Abstract:Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.
45. 【2602.20379】Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
链接:https://arxiv.org/abs/2602.20379
作者:Mukul Chhabra,Luigi Medrano,Arush Verma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Enterprise Retrieval-Augmented Generation, reflect operational constraints, Retrieval-Augmented Generation, structured identifiers, error codes
备注: 12 pages including appendix, 6 figures
点击查看摘要
Abstract:Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.
Comments:
12 pages including appendix, 6 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.20379 [cs.CL]
(or
arXiv:2602.20379v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.20379
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
46. 【2602.20372】How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity
链接:https://arxiv.org/abs/2602.20372
作者:Chundra Cathcart,Arne Rubehn,Katja Bocklage,Luca Ciucci,Kellen Parker van Dam,Alžběta Kučerová,Jekaterina Mažara,Carlo Y. Meloni,David Snee,Johann-Mattis List
类目:Computation and Language (cs.CL)
关键词:Recent research argues, optimize communicative efficiency, average morphosyntactic complexity, exact recursive numeral, systems optimize communicative
备注:
点击查看摘要
Abstract:Recent research argues that exact recursive numeral systems optimize communicative efficiency by balancing a tradeoff between the size of the numeral lexicon and the average morphosyntactic complexity (roughly length in morphemes) of numeral terms. We argue that previous studies have not characterized the data in a fashion that accounts for the degree of complexity languages display. Using data from 52 genetically diverse languages and an annotation scheme distinguishing between predictable and unpredictable allomorphy (formal variation), we show that many of the world's languages are decisively less efficient than one would expect. We discuss the implications of our findings for the study of numeral systems and linguistic evolution more generally.
47. 【2602.20336】Natural Language Processing Models for Robust Document Categorization
链接:https://arxiv.org/abs/2602.20336
作者:Radoslaw Roszczyk,Pawel Tecza,Maciej Stodolski,Krzysztof Siwek
类目:Computation and Language (cs.CL)
关键词:unbalanced document categorization, learning methods applied, machine learning methods, automated text classification, alongside the design
备注: 13 pages, 1 fiure, 5 tables
点击查看摘要
Abstract:This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5\%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.
Comments:
13 pages, 1 fiure, 5 tables
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.20336 [cs.CL]
(or
arXiv:2602.20336v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.20336
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
48. 【2602.20332】No One Size Fits All: QueryBandits for Hallucination Mitigation
链接:https://arxiv.org/abs/2602.20332
作者:Nicole Cho,William Watson,Alec Koppel,Sumitra Ganesh,Manuela Veloso
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Advanced reasoning capabilities, Large Language, mitigation work focuses, Advanced reasoning
备注:
点击查看摘要
Abstract:Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.
49. 【2602.20324】An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
链接:https://arxiv.org/abs/2602.20324
作者:Cathy Shyr,Yan Hu,Rory J. Tinker,Thomas A. Cassini,Kevin W. Byram,Rizwan Hamid,Daniel V. Fabbri,Adam Wright,Josh F. Peterson,Lisa Bastarache,Hua Xu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Human Phenotype Ontology, HPO terms, difficult to scale, University Medical Center, Undiagnosed Diseases Network
备注:
点击查看摘要
Abstract:Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.
50. 【2602.20300】What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
链接:https://arxiv.org/abs/2602.20300
作者:William Watson,Nicole Cho,Sumitra Ganesh,Manuela Veloso
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Model, Large Language, Language Model, decoding strategy, treated as defects
备注: EACL 2026 Findings
点击查看摘要
Abstract:Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.
51. 【2602.20294】InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
链接:https://arxiv.org/abs/2602.20294
作者:Yu Li,Pranav Narayanan Venkit,Yada Pruksachatkun,Chien-Sheng Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:requires grounding generation, Simulating real personalities, Simulating real, language models requires, models requires grounding
备注:
点击查看摘要
Abstract:Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.
52. 【2602.20224】Exploring Anti-Aging Literature via ConvexTopics and Large Language Models
链接:https://arxiv.org/abs/2602.20224
作者:Lana E. Yeganova,Won G. Kim,Shubo Tian,Natalie Xie,Donald C. Comeau,W. John Wilbur,Zhiyong Lu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:detecting emerging trends, biomedical publications creates, publications creates challenges, emerging trends, rapid expansion
备注:
点击查看摘要
Abstract:The rapid expansion of biomedical publications creates challenges for organizing knowledge and detecting emerging trends, underscoring the need for scalable and interpretable methods. Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation. We propose a reformulation of a convex optimization based clustering algorithm that produces stable, fine-grained topics by selecting exemplars from the data and guaranteeing a global optimum. Applied to about 12,000 PubMed articles on aging and longevity, our method uncovers topics validated by medical experts. It yields interpretable topics spanning from molecular mechanisms to dietary supplements, physical activity, and gut microbiota. The method performs favorably, and most importantly, its reproducibility and interpretability distinguish it from common clustering approaches, including K-means, LDA, and BERTopic. This work provides a basis for developing scalable, web-accessible tools for knowledge discovery.
53. 【2602.20191】MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
链接:https://arxiv.org/abs/2602.20191
作者:Dongwei Wang,Jinhee Kim,Seokho Han,Denis Gudovskiy,Yohei Nakata,Tomoyuki Okuno,KhayTze Peong,Kang Eun Jeon,Jong Hwan Ko,Yiran Chen,Huanrui Yang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Changing runtime complexity, large language model, edge devices necessitates, devices necessitates elastic, necessitates elastic large
备注: 17 pages, 12 figures
点击查看摘要
Abstract:Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically linked to specific precisions, which presents challenges during elastic-precision calibration and precision switching at runtime. In this work, we attribute the source of varying calibration parameters to the varying token-level sensitivity caused by a precision-dependent outlier migration this http URL by this observation, we propose \texttt{MoBiQuant}, a novel Mixture-of-Bits quantization framework that adjusts weight precision for elastic LLM inference based on token sensitivity. Specifically, we propose the many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights and the token-aware router to dynamically select the number of residual bit slices. MoBiQuant enables smooth precision switching while improving generalization for the distribution of token outliers. Experimental results demonstrate that MoBiQuant exhibits strong elasticity, enabling it to match the performance of bit-specific calibrated PTQ on LLaMA3-8B without repeated calibration.
54. 【2602.20166】ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling
链接:https://arxiv.org/abs/2602.20166
作者:Yongda Yu,Lei Zhang,Xinxin Guo,Minghui Yu,Zhengqi Zhuang,Guoping Rong,Haifeng Shen,Zhengfeng Li,Boge Wang,Guoan Zhang,Bangyu Xiang,Xiaobin Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:involving intelligent agents, overlook critical issues, applications involving intelligent, so-called alert fatigue, intelligent agents
备注:
点击查看摘要
Abstract:In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ''alert fatigue''. A common strategy is to train a reflection model as a filter to intercept false alerts with labelled data collected from user verification feedback. However, a key challenge is the noisy nature of such data as it is often collected in production environments. As cleaning noise via manual annotation incurs high costs, this paper proposes a novel method ConceptRM for constructing a high-quality corpus to train a reflection model capable of effectively intercepting false alerts. With only a small amount of expert annotations as anchors, ConceptRM creates perturbed datasets with varying noise ratios and utilizes co-teaching to train multiple distinct models for collaborative learning. By analyzing the consensus decisions of these models, it effectively identifies reliable negative samples from a noisy dataset. Experimental results demonstrate that ConceptRM significantly enhances the interception of false alerts with minimal annotation cost, outperforming several state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.
55. 【2602.20164】Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings
链接:https://arxiv.org/abs/2602.20164
作者:Sachin Gopal Wani,Eric Page,Ajay Dholakia,David Ellison
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Knowledge distillation offers, small language models, developing powerful, small language, suitable for resource-constrained
备注: 16 pages, 5 figures, accepted at the the 2025 TPCTC Conference
点击查看摘要
Abstract:Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI
56. 【2602.20163】Graph Modelling Analysis of Speech-Gesture Interaction for Aphasia Severity Estimation
链接:https://arxiv.org/abs/2602.20163
作者:Navya Martin Kollapally,Christa Akers,Renjith Nelson Joseph
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:acquired language disorder, language disorder caused, disorder caused, caused by injury, Aphasia
备注: IJCAI
点击查看摘要
Abstract:Aphasia is an acquired language disorder caused by injury to the regions of the brain that are responsible for language. Aphasia may impair the use and comprehension of written and spoken language. The Western Aphasia Battery-Revised (WAB-R) is an assessment tool administered by speech-language pathologists (SLPs) to evaluate the aphasia type and severity. Because the WAB-R measures isolated linguistic skills, there has been growing interest in the assessment of discourse production as a more holistic representation of everyday language abilities. Recent advancements in speech analysis focus on automated estimation of aphasia severity from spontaneous speech, relying mostly in isolated linguistic or acoustical features. In this work, we propose a graph neural network-based framework for estimating aphasia severity. We represented each participant's discourse as a directed multi-modal graph, where nodes represent lexical items and gestures and edges encode word-word, gesture-word, and word-gesture transitions. GraphSAGE is employed to learn participant-level embeddings, thus integrating information from immediate neighbors and overall graph structure. Our results suggest that aphasia severity is not encoded in isolated lexical distribution, but rather emerges from structured interactions between speech and gesture. The proposed architecture offers a reliable automated aphasia assessment, with possible uses in bedside screening and telehealth-based monitoring.
57. 【2602.20162】alking to Yourself: Defying Forgetting in Large Language Models
链接:https://arxiv.org/abs/2602.20162
作者:Yutao Sun,Mingshuai Chen,Tiancheng Zhao,Phillip Miao,Zilun Zhang,Haozhan Shen,Ruizhe Zhu,Jianwei Yin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:fine-tuning large language, reasoning abilities, Catastrophic forgetting remains, remains a major, major challenge
备注:
点击查看摘要
Abstract:Catastrophic forgetting remains a major challenge when fine-tuning large language models (LLMs) on narrow, task-specific data, often degrading their general knowledge and reasoning abilities. We propose SA-SFT, a lightweight self-augmentation routine in which an LLM generates self-dialogues prior to fine-tuning, and the resulting self-authored data are mixed with task data without modifying optimization or training schedules. Despite requiring no external data or additional tuning, SA-SFT consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, it maintains performance comparable to the original model and achieves the best results in 40 cases, outperforming common baselines such as layer freezing and external data mixing. Guided by these empirical findings, we further present a theoretical analysis suggesting that forgetting can partly stem from style-induced parameter drift, and that self-alignment through self-generated data provides an effective means to counteract this effect. Overall, our results indicate that self-augmentation offers a simple and effective mechanism for robust LLM adaptation without incurring catastrophic forgetting.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.20162 [cs.CL]
(or
arXiv:2602.20162v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.20162
Focus to learn more
arXiv-issued DOI via DataCite</p>
58. 【2601.12815】Multimodal Multi-Agent Empowered Legal Judgment Prediction
链接:https://arxiv.org/abs/2601.12815
作者:Zhaolu Kang,Junhao Gong,Qingxi Chen,Hao Zhang,Jiaxin Liu,Rong Fu,Zhiyuan Feng,Yuan Wang,Simon Fong,Kaiyue Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
关键词:Legal Judgment Prediction, Judgment Prediction, legal cases based, aims to predict, factual descriptions
备注: Accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
点击查看摘要
Abstract:Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.
59. 【2602.20994】Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures
链接:https://arxiv.org/abs/2602.20994
作者:Yubin Ge,Yongsong Huang,Xiaofeng Liu
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:dense tumor voxel, tumor voxel labels, learning seeks, seeks to alleviate, voxel labels
备注: IEEE International Symposium on Biomedical Imaging (ISBI) 2026
点击查看摘要
Abstract:Report-supervised (RSuper) learning seeks to alleviate the need for dense tumor voxel labels with constraints derived from radiology reports (e.g., volumes, counts, sizes, locations). In MRI studies of brain tumors, however, we often involve multi-parametric scans and substructures. Here, fine-grained modality/parameter-wise reports are usually provided along with global findings and are correlated with different substructures. Moreover, the reports often describe only the largest lesion and provide qualitative or uncertain cues (``mild,'' ``possible''). Classical RSuper losses (e.g., sum volume consistency) can over-constrain or hallucinate unreported findings under such incompleteness, and are unable to utilize these hierarchical findings or exploit the priors of varied lesion types in a merged dataset. We explicitly parse the global quantitative and modality-wise qualitative findings and introduce a unified, one-sided, uncertainty-aware formulation (MS-RSuper) that: (i) aligns modality-specific qualitative cues (e.g., T1c enhancement, FLAIR edema) with their corresponding substructures using existence and absence losses; (ii) enforces one-sided lower-bounds for partial quantitative cues (e.g., largest lesion size, minimal multiplicity); and (iii) adds extra- vs. intra-axial anatomical priors to respect cohort differences. Certainty tokens scale penalties; missing cues are down-weighted. On 1238 report-labeled BraTS-MET/MEN scans, our MS-RSuper largely outperforms both a sparsely-supervised baseline and a naive RSuper method.
信息检索
1. 【2602.21202】Multi-Vector Index Compression in Any Modality
链接:https://arxiv.org/abs/2602.21202
作者:Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:study efficient multi-vector, efficient multi-vector retrieval, late interaction, study efficient, efficient multi-vector
备注: 12 pages, 4 figures
点击查看摘要
Abstract:We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: this http URL.
2. 【2602.21143】A Benchmark for Deep Information Synthesis
链接:https://arxiv.org/abs/2602.21143
作者:Debjit Paul,Daniel Murphy,Milan Gritta,Ronald Cardenas,Victor Prokhorov,Lena Sophia Bolliger,Aysim Toker,Roy Miles,Andreea-Maria Oncescu,Jasivan Alex Sivakumar,Philipp Borchert,Ismail Elezi,Meiru Zhang,Ka Yiu Lee,Guchun Zhang,Jun Wang,Gerasimos Lampouras
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large language model, complex tasks involving, tasks involving tool, code execution, solve complex tasks
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.
3. 【2602.21103】Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
链接:https://arxiv.org/abs/2602.21103
作者:Sanket Badhe,Deep Shah
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:test-time inference costs, Advanced reasoning typically, substantial test-time inference, reasoning typically requires, incurs prohibitive latency
备注:
点击查看摘要
Abstract:Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.
4. 【2602.21099】urning Semantics into Topology: LLM-Driven Attribute Augmentation for Collaborative Filtering
链接:https://arxiv.org/abs/2602.21099
作者:Junjie Meng,Ranxu zhang,Wei Wu,Rui Zhang,Chuan Qin,Qi Zhang,Qi Liu,Hui Xiong,Chao Wang
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, shown great potential, enhancing recommender systems, reasoning capabilities
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown great potential for enhancing recommender systems through their extensive world knowledge and reasoning capabilities. However, effectively translating these semantic signals into traditional collaborative embeddings remains an open challenge. Existing approaches typically fall into two extremes: direct inference methods are computationally prohibitive for large-scale retrieval, while embedding-based methods primarily focus on unilateral feature augmentation rather than holistic collaborative signal enhancement. To bridge this gap, we propose Topology-Augmented Graph Collaborative Filtering (TAGCF), a novel framework that transforms semantic knowledge into topological connectivity. Unlike existing approaches that depend on textual features or direct interaction synthesis, TAGCF employs LLMs to infer interaction intents and underlying causal relationships from user-item pairs, representing these insights as intermediate attribute nodes within an enriched User-Attribute-Item (U-A-I) graph. Furthermore, to effectively model the heterogeneous relations in this augmented structure, we propose Adaptive Relation-weighted Graph Convolution (ARGC), which employs relation-specific prediction networks to dynamically estimate the importance of each relation type. Extensive experiments across multiple benchmark datasets and CF backbones demonstrate consistent improvements, with comprehensive evaluations including cold-start scenarios validating the effectiveness and robustness of our framework. All code will be made publicly available. For anonymous review, our code is available at the following anonymous link: this https URL.
5. 【2602.21052】Position-Aware Sequential Attention for Accurate Next Item Recommendations
链接:https://arxiv.org/abs/2602.21052
作者:Timur Nabiev,Evgeny Frolov
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:inject positional information, additive positional embeddings, models usually rely, Sequential self-attention models, positional
备注:
点击查看摘要
Abstract:Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is permutation-equivariant over sequence positions and thus has no intrinsic notion of temporal order beyond causal masking. We argue that additive positional embeddings make the attention mechanism only superficially sensitive to sequence order: positional information is entangled with item embedding semantics, propagates weakly in deep architectures, and limits the ability to capture rich sequential patterns. To address these limitations, we introduce a kernelized self-attention mechanism, where a learnable positional kernel operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights. When applied per attention block, this kernel enables adaptive multi-scale sequential modeling. Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.
6. 【2602.21009】HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
链接:https://arxiv.org/abs/2602.21009
作者:Kun Yuan,Junyu Bi,Daixuan Cheng,Changfa Wu,Shuwen Xiao,Binbin Cao,Jian Wu,Yuning Jiang
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Modern recommender systems, recommender systems leverage, systems leverage ultra-long, leverage ultra-long user, Modern recommender
备注:
点击查看摘要
Abstract:Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints. While summarizing history via interest centers offers a practical alternative, existing methods struggle to (1) identify user-specific centers at appropriate granularity and (2) accurately assign behaviors, leading to quantization errors and loss of long-tail preferences. To alleviate these issues, we propose Hierarchical Sparse Activation Compression (HiSAC), an efficient framework for personalized sequence modeling. HiSAC encodes interactions into multi-level semantic IDs and constructs a global hierarchical codebook. A hierarchical voting mechanism sparsely activates personalized interest-agents as fine-grained preference centers. Guided by these agents, Soft-Routing Attention aggregates historical signals in semantic space, weighting by similarity to minimize quantization error and retain long-tail behaviors. Deployed on Taobao's "Guess What You Like" homepage, HiSAC achieves significant compression and cost reduction, with online A/B tests showing a consistent 1.65% CTR uplift -- demonstrating its scalability and real-world effectiveness.
7. 【2602.20995】Generative Pseudo-Labeling for Pre-Ranking with LLMs
链接:https://arxiv.org/abs/2602.20995
作者:Junyu Bi,Xinting Niu,Daixuan Cheng,Kun Yuan,Tao Wang,Binbin Cao,Jian Wu,Yuning Jiang
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:efficiently scoring thousands, tasked with efficiently, downstream ranking, critical stage, stage in industrial
备注:
点击查看摘要
Abstract:Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discrepancy: pre-ranking models are trained only on exposed interactions, yet must score all recalled candidates -- including unexposed items -- during online serving. This mismatch not only induces severe sample selection bias but also degrades generalization, especially for long-tail content. Existing debiasing approaches typically rely on heuristics (e.g., negative sampling) or distillation from biased rankers, which either mislabel plausible unexposed items as negatives or propagate exposure bias into pseudo-labels. In this work, we propose Generative Pseudo-Labeling (GPL), a framework that leverages large language models (LLMs) to generate unbiased, content-aware pseudo-labels for unexposed items, explicitly aligning the training distribution with the online serving space. By offline generating user-specific interest anchors and matching them with candidates in a frozen semantic space, GPL provides high-quality supervision without adding online latency. Deployed in a large-scale production system, GPL improves click-through rate by 3.07%, while significantly enhancing recommendation diversity and long-tail item discovery.
8. 【2602.20986】Naver Labs Europe @ WSDM CUP | Multilingual Retrieval
链接:https://arxiv.org/abs/2602.20986
作者:Thibault Formal,Maxime Louis,Hervé Déjean,Stéphane Clinchant
类目:Information Retrieval (cs.IR)
关键词:English queries, WSDM Cup, presents our participation, shared task, multilingual document retrieval
备注: Report paper of our submission to the WSDM Cup 2026
点击查看摘要
Abstract:This report presents our participation to the WSDM Cup 2026 shared task on multilingual document retrieval from English queries. The task provides a challenging benchmark for cross-lingual generalization. It also provides a natural testbed for evaluating SPLARE, our recently proposed learned sparse retrieval model, which produces generalizable sparse latent representations and is particularly well suited to multilingual retrieval settings. We evaluate five progressively enhanced runs, starting from a SPLARE-7B model and incorporating lightweight improvements, including reranking with Qwen3-Reranker-4B and simple score fusion strategies. Our results demonstrate the strength of SPLARE compared to state-of-the-art dense baselines such as Qwen3-8B-Embed. More broadly, our submission highlights the continued relevance and competitiveness of learned sparse retrieval models beyond English-centric scenarios.
Comments:
Report paper of our submission to the WSDM Cup 2026
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2602.20986 [cs.IR]
(or
arXiv:2602.20986v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.20986
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
9. 【2602.20877】E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
链接:https://arxiv.org/abs/2602.20877
作者:Jiwoo Kang,Yeon-Chang Lee
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Multimodal recommender systems, enhance collaborative filtering, leveraging item-side modalities, task-specific objectives limits, Multimodal Knowledge Graph
备注:
点击查看摘要
Abstract:Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalization. We propose E-MMKGR, a framework that constructs an e-commerce-specific Multimodal Knowledge Graph E-MMKG and learns unified item representations through GNN-based propagation and KG-oriented optimization. These representations provide a shared semantic foundation applicable to diverse tasks. Experiments on real-world Amazon datasets show improvements of up to 10.18% in Recall@10 for recommendation and up to 21.72% over vector-based retrieval for product search, demonstrating the effectiveness and extensibility of our approach.
10. 【2602.20800】Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking
链接:https://arxiv.org/abs/2602.20800
作者:Dalia Nahhas,Xiaohao Cai,Imran Razzak,Shoaib Jameel
类目:Information Retrieval (cs.IR)
关键词:Generative Information Retrieval, Information Retrieval, Generative Information, selection of candidates, bottleneck has shifted
备注:
点击查看摘要
Abstract:In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.
11. 【2602.20735】RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
链接:https://arxiv.org/abs/2602.20735
作者:Kun Ran,Marwah Alaofi,Danula Hettiachchi,Chenglong Ma,Khoi Nguyen Dinh Anh,Khoi Vo Nguyen,Sachin Pathiyan Cherumanal,Lida Rashidi,Falk Scholer,Damiano Spina,Shuoqi Sun,Oleg Zendel
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Toggle, Toggle Hugging Face, Code Toggle Papers, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic
备注: MMU-RAG NeurIPS 2025 winning system
点击查看摘要
Abstract:This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG (R2RAG), a research-focused retrieval-augmented generation (RAG) architecture composed of lightweight components that dynamically adapt the retrieval strategy based on inferred query complexity and evidence sufficiency. The system uses smaller LLMs, enabling operation on a single consumer-grade GPU while supporting complex research tasks. It builds on the G-RAG system, winner of the ACM~SIGIR~2025 LiveRAG Challenge, and extends it with modules informed by qualitative review of outputs. R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
Comments:
MMU-RAG NeurIPS 2025 winning system
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.20735 [cs.IR]
(or
arXiv:2602.20735v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.20735
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Oleg Zendel [view email] [v1]
Tue, 24 Feb 2026 09:58:25 UTC (93 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition, by Kun Ran and 11 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.IR
prev
|
next
new
|
recent
| 2026-02
Change to browse by:
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
12. 【2602.20704】IntRR: A Framework for Integrating SID Redistribution and Length Reduction
链接:https://arxiv.org/abs/2602.20704
作者:Zesheng Wang,Longfei Xu,Weidong Deng,Huimin Yan,Kaikui Liu,Xiangxiang Chu
类目:Information Retrieval (cs.IR)
关键词:traditional cascade ranking, cascade ranking system, generation task, discrete Semantic IDs, transformative paradigm
备注:
点击查看摘要
Abstract:Generative Recommendation (GR) has emerged as a transformative paradigm that reformulates the traditional cascade ranking system into a sequence-to-item generation task, facilitated by the use of discrete Semantic IDs (SIDs). However, current SIDs are suboptimal as the indexing objectives (Stage 1) are misaligned with the actual recommendation goals (Stage 2). Since these identifiers remain static (Stage 2), the backbone model lacks the flexibility to adapt them to the evolving complexities of user interactions. Furthermore, the prevailing strategy of flattening hierarchical SIDs into token sequences leads to sequence length inflation, resulting in prohibitive computational overhead and inference latency. To address these challenges, we propose IntRR, a novel framework that integrates objective-aligned SID Redistribution and structural Length Reduction. By leveraging item-specific Unique IDs (UIDs) as collaborative anchors, this approach dynamically redistributes semantic weights across hierarchical codebook layers. Concurrently, IntRR handles the SID hierarchy recursively, eliminating the need to flatten sequences. This ensures a fixed cost of one token per item. Extensive experiments on benchmark datasets demonstrate that IntRR yields substantial improvements over representative generative baselines, achieving superior performance in both recommendation accuracy and efficiency.
13. 【2602.20676】PRECTR-V2:Unified Relevance-CTR Framework with Cross-User Preference Mining, Exposure Bias Correction, and LLM-Distilled Encoder Optimization
链接:https://arxiv.org/abs/2602.20676
作者:Shuzhi Cao,Rong Chen,Ailong He,Shuguang Han,Jufeng Chen
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:enhancing platform revenue, discovering users' interests, effectively coordinating, click-through rate, prediction is crucial
备注: arXiv admin note: text overlap with [arXiv:2503.18395](https://arxiv.org/abs/2503.18395)
点击查看摘要
Abstract:In search systems, effectively coordinating the two core objectives of search relevance matching and click-through rate (CTR) prediction is crucial for discovering users' interests and enhancing platform revenue. In our prior work PRECTR, we proposed a unified framework to integrate these two subtasks,thereby eliminating their inconsistency and leading to mutual this http URL, our previous work still faces three main challenges. First, low-active users and new users have limited search behavioral data, making it difficult to achieve effective personalized relevance preference modeling. Second, training data for ranking models predominantly come from high-relevance exposures, creating a distribution mismatch with the broader candidate space in coarse-ranking, leading to generalization bias. Third, due to the latency constraint, the original model employs an Emb+MLP architecture with a frozen BERT encoder, which prevents joint optimization and creates misalignment between representation learning and CTR fine-tuning. To solve these issues, we further reinforce our method and propose PRECTR-V2. Specifically, we mitigate the low-activity users' sparse behavior problem by mining global relevance preferences under the specific query, which facilitates effective personalized relevance modeling for cold-start scenarios. Subsequently, we construct hard negative samples through embedding noise injection and relevance label reconstruction, and optimize their relative ranking against positive samples via pairwise loss, thereby correcting exposure bias. Finally, we pretrain a lightweight transformer-based encoder via knowledge distillation from LLM and SFT on the text relevance classification task. This encoder replaces the frozen BERT module, enabling better adaptation to CTR fine-tuning and advancing beyond the traditional Emb+MLP paradigm.
14. 【2602.20558】From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
链接:https://arxiv.org/abs/2602.20558
作者:Yucheng Shi,Ying Li,Yu Wang,Yesu Feng,Arjun Rao,Rein Houthooft,Shradha Sehgal,Jin Wang,Hao Zhen,Ninghao Liu,Linas Baltrunas
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large language models, natural language inputs, challenge remains underexplored, key challenge remains, Large language
备注: Work in progress
点击查看摘要
Abstract:Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalization agent transforms raw interaction histories into optimized textual contexts, with recommendation accuracy as the training signal. This agent learns to filter noise, incorporate relevant metadata, and reorganize information to improve downstream predictions. Experiments on a large-scale industrial streaming dataset show that learned verbalization delivers up to 93% relative improvement in discovery item recommendation accuracy over template-based baselines. Further analysis reveals emergent strategies such as user interest summarization, noise removal, and syntax normalization, offering insights into effective context construction for LLM-based recommender systems.
15. 【2602.20507】Indaleko: The Unified Personal Index
链接:https://arxiv.org/abs/2602.20507
作者:William Anthony Mason
类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
关键词:Personal information retrieval, Unified Personal Index, Personal information, Indaleko, Personal Index
备注: PhD dissertation, University of British Columbia, August 2025. 287 pages
点击查看摘要
Abstract:Personal information retrieval fails when systems ignore how human memory works. While existing platforms force keyword searches across isolated silos, humans naturally recall through episodic cues like when, where, and in what context information was encountered. This dissertation presents the Unified Personal Index (UPI), a memory-aligned architecture that bridges this fundamental gap. The Indaleko prototype demonstrates the UPI's feasibility on a 31-million file dataset spanning 160TB across eight storage platforms. By integrating temporal, spatial, and activity metadata into a unified graph database, Indaleko enables natural language queries like "photos near the conference venue last spring" that existing systems cannot process. The implementation achieves sub-second query responses through memory anchor indexing, eliminates cross-platform search fragmentation, and maintains perfect precision for well-specified memory patterns. Evaluation against commercial systems (Google Drive, OneDrive, Dropbox, Windows Search) reveals that all fail on memory-based queries, returning overwhelming result sets without contextual filtering. In contrast, Indaleko successfully processes multi-dimensional queries combining time, location, and activity patterns. The extensible architecture supports rapid integration of new data sources (10 minutes to 10 hours per provider) while preserving privacy through UUID-based semantic decoupling. The UPI's architectural synthesis bridges cognitive theory with distributed systems design, as demonstrated through the Indaleko prototype and rigorous evaluation. This work transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.
计算机视觉
1. 【2602.21204】st-Time Training with KV Binding Is Secretly Linear Attention
链接:https://arxiv.org/abs/2602.21204
作者:Junchen Liu,Sven Elflein,Or Litany,Zan Gojcic,Ruilong Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:sequence modeling layer, test time, binding as sequence, sequence modeling, modeling layer
备注: Webpage: [this https URL](https://research.nvidia.com/labs/sil/projects/tttla/)
点击查看摘要
Abstract:Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.
2. 【2602.21203】Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics
链接:https://arxiv.org/abs/2602.21203
作者:Abdulaziz Almuzairee,Henrik I. Christensen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Visual reinforcement learning, on-policy methods parallelize, robotics but expensive, sample-efficient yet slow, waste samples
备注: For website and code, see [this https URL](https://aalmuzairee.github.io/squint)
点击查看摘要
Abstract:Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.
3. 【2602.21202】Multi-Vector Index Compression in Any Modality
链接:https://arxiv.org/abs/2602.21202
作者:Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:study efficient multi-vector, efficient multi-vector retrieval, late interaction, study efficient, efficient multi-vector
备注: 12 pages, 4 figures
点击查看摘要
Abstract:We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: this http URL.
4. 【2602.21198】Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
链接:https://arxiv.org/abs/2602.21198
作者:Yining Hong,Huang Huang,Manling Li,Li Fei-Fei,Jiajun Wu,Yejin Choi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Embodied LLMs endow, high-level task reasoning, LLMs endow robots, Embodied LLMs, Reflective Test-Time Planning
备注:
点击查看摘要
Abstract:Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
5. 【2602.21195】Region of Interest Segmentation and Morphological Analysis for Membranes in Cryo-Electron Tomography
链接:https://arxiv.org/abs/2602.21195
作者:Xingyi Cheng,Julien Maufront,Aurélie Di Cicco,Daniël M. Pelt,Manuela Dezi,Daniel Lévy
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cryo-electron tomography, enables high resolution, high resolution, three-dimensional reconstruction, reconstruction of biological
备注:
点击查看摘要
Abstract:Cryo-electron tomography (cryo-ET) enables high resolution, three-dimensional reconstruction of biological structures, including membranes and membrane proteins. Identification of regions of interest (ROIs) is central to scientific imaging, as it enables isolation and quantitative analysis of specific structural features within complex datasets. In practice, however, ROIs are typically derived indirectly through full structure segmentation followed by post hoc analysis. This limitation is especially apparent for continuous and geometrically complex structures such as membranes, which are segmented as single entities. Here, we developed TomoROIS-SurfORA, a two step framework for direct, shape-agnostic ROI segmentation and morphological surface analysis. TomoROIS performs deep learning-based ROI segmentation and can be trained from scratch using small annotated datasets, enabling practical application across diverse imaging data. SurfORA processes segmented structures as point clouds and surface meshes to extract quantitative morphological features, including inter-membrane distances, curvature, and surface roughness. It supports both closed and open surfaces, with specific considerations for open surfaces, which are common in cryo-ET due to the missing wedge effect. We demonstrate both tools using in vitro reconstituted membrane systems containing deformable vesicles with complex geometries, enabling automatic quantitative analysis of membrane contact sites and remodeling events such as invagination. While demonstrated here on cryo-ET membrane data, the combined approach is applicable to ROI detection and surface analysis in broader scientific imaging contexts.
6. 【2602.21188】Human Video Generation from a Single Image with 3D Pose and View Control
链接:https://arxiv.org/abs/2602.21188
作者:Tiantian Wang,Chun-Han Yao,Tao Hu,Mallikarjun Byrasandra Ramalinga Reddy,Ming-Hsuan Yang,Varun Jampani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made significant progress, visual generation capabilities, powerful visual generation, human video generation, Recent diffusion methods
备注:
点击查看摘要
Abstract:Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.
7. 【2602.21186】Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
链接:https://arxiv.org/abs/2602.21186
作者:Haoyi Jiang,Liu Liu,Xinjie Wang,Yonghao He,Wei Sui,Zhizhong Su,Wenyu Liu,Xinggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exhibit exceptional, remains superficial, ability to comprehend, comprehend and reason, Vision-Language Models
备注:
点击查看摘要
Abstract:While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at this https URL.
8. 【2602.21179】Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision
链接:https://arxiv.org/abs/2602.21179
作者:Nicolás Gaggion,Maria J. Ledesma-Carbayo,Stergios Christodoulidis,Maria Vakalopoulou,Enzo Ferrante
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Graph-based medical image, providing fixed-topology landmarks, medical image segmentation, image segmentation represents, inherent population-level correspondences
备注:
点击查看摘要
Abstract:Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.
9. 【2602.21178】XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
链接:https://arxiv.org/abs/2602.21178
作者:Sepehr Salem Ghahfarokhi,M. Moein Esfahani,Raj Sunderraman,Vince Calhoun,Mohammed Alser
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:significantly advanced automated, clinical adoption remains, adoption remains limited, advanced automated brain, Deep learning
备注: Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences
点击查看摘要
Abstract:Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ''black boxes'' and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: this https URL.
10. 【2602.21175】Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
链接:https://arxiv.org/abs/2602.21175
作者:Jianglin Lu,Simon Jenni,Kushal Kafle,Jing Shi,Handong Zhao,Yun Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental task, real-world scenarios, queries, vision-language learning, quality
备注:
点击查看摘要
Abstract:Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at this https URL.
11. 【2602.21172】NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
链接:https://arxiv.org/abs/2602.21172
作者:Ishaan Rawal,Shubh Gupta,Yihan Hu,Wei Zhan
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:replacing modular pipelines, models are advancing, pipelines with unified, advancing autonomous driving, driving by replacing
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.
12. 【2602.21153】SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement
链接:https://arxiv.org/abs/2602.21153
作者:Bastien Gimbert
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:skeletal animation frameworks, triangle meshes compatible, present SPRITETOMESH, fully automatic pipeline, fully automatic
备注: 11 pages, 17 figures. Code available at [this https URL](https://github.com/BastienGimbert/SpriteToMesh)
点击查看摘要
Abstract:We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.
13. 【2602.21142】LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
链接:https://arxiv.org/abs/2602.21142
作者:Zhifan Jiang,Dong Yang,Vishwesh Nath,Abhijeet Parida,Nishad P. Kulkarni,Ziyue Xu,Daguang Xu,Syed Muhammad Anwar,Holger R. Roth,Marius George Linguraru
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large vision-language models, Large vision-language, clinical domain, evolved from general-purpose, specialized use cases
备注: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026
点击查看摘要
Abstract:Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.
14. 【2602.21141】SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception
链接:https://arxiv.org/abs/2602.21141
作者:Jose Moises Araya-Martinez,Thushar Tom,Adrián Sanchis Reig,Pablo Rey Valiente,Jens Lambrecht,Jörg Krüger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Object perception, robotic material handling, quality inspection, fundamental for tasks, material handling
备注:
点击查看摘要
Abstract:Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.
15. 【2602.21137】UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics
链接:https://arxiv.org/abs/2602.21137
作者:Joseph Raj Vishal,Nagasiri Poluri,Katha Naik,Rutuja Patil,Kashyap Hegde Kota,Krishna Vinod,Prithvi Jai Ramesh,Mohammad Farhadi,Yezhou Yang,Bharatesh Chakravarthi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:urban traffic remains, dynamic urban scenes, introduces Urban Dynamics, Urban Dynamics VideoQA, multi-agent dynamics
备注:
点击查看摘要
Abstract:Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at this https URL.
16. 【2602.21105】BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
链接:https://arxiv.org/abs/2602.21105
作者:Jiaxing Yu,Dongyang Ren,Hangyu Xu,Zhouyuxiao Yang,Yuanqi Li,Jie Guo,Zhengkang Zhou,Yanwen Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:B-rep Gaussian Splatting, trimmed corners, Recovering B-rep representation, explicit boundaries, propose B-rep Gaussian
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.
17. 【2602.21101】Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones
链接:https://arxiv.org/abs/2602.21101
作者:Rong Zou,Marco Cannici,Davide Scaramuzza
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:promise rapid inspection, limited battery constraints, aerial robots promise, robots promise rapid, Fast-flying aerial robots
备注:
点击查看摘要
Abstract:Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.
18. 【2602.21100】Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
链接:https://arxiv.org/abs/2602.21100
作者:Noé Artru,Rukhshanda Hussain,Emeline Got,Alexandre Messier,David B. Lindell,Abdallah Dib
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:face fundamental limitations, existing methods face, head geometry, range of applications, Reconstructing high-fidelity
备注: 14 pages, 8 figures, to be published in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
点击查看摘要
Abstract:Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.
19. 【2602.21098】Optimizing Occupancy Sensor Placement in Smart Environments
链接:https://arxiv.org/abs/2602.21098
作者:Hao Lu,Richard J. Radke
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:realizing energy savings, Understanding the locations, commercial built environment, delivering lighting, commercial built
备注:
点击查看摘要
Abstract:Understanding the locations of occupants in a commercial built environment is critical for realizing energy savings by delivering lighting, heating, and cooling only where it is needed. The key to achieving this goal is being able to recognize zone occupancy in real time, without impeding occupants' activities or compromising privacy. While low-resolution, privacy-preserving time-of-flight (ToF) sensor networks have demonstrated good performance in zone counting, the performance depends on careful sensor placement. To address this issue, we propose an automatic sensor placement method that determines optimal sensor layouts for a given number of sensors, and can predict the counting accuracy of such a layout. In particular, given the geometric constraints of an office environment, we simulate a large number of occupant trajectories. We then formulate the sensor placement problem as an integer linear programming (ILP) problem and solve it with the branch and bound method. We demonstrate the effectiveness of the proposed method based on simulations of several different office environments.
20. 【2602.21078】ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
链接:https://arxiv.org/abs/2602.21078
作者:Duowen Chen,Yan Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Federated Semi-Supervised Learning, Federated Semi-Supervised, Semi-Supervised Learning, leveraging partially-annotated local, aims to collaboratively
备注: CVPR 2026. code: [this https URL](https://github.com/DuowenC/FSSLlib)
点击查看摘要
Abstract:Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low-confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re-include the discarded samples into training by a positive-negative proxy pool to mitigate the impact of potentially-incorrect pseudo-labels. Insight experiments theoretical analysis show our significant performance and convergence in FSSL.
21. 【2602.21064】Motivation is Something You Need
链接:https://arxiv.org/abs/2602.21064
作者:Mehdi Acheli,Walid Gaaloul
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:affective neuroscience, work introduces, paradigm that draws, draws from affective, SEEKING motivational state
备注:
点击查看摘要
Abstract:This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined "motivation conditions". The framework mimics the emotional state of high curiosity and anticipation of reward in which broader brain regions are recruited to enhance cognitive performance. Exploiting scalable architectures where larger models extend smaller ones, our method enables shared weight updates and selective expansion of network capacity during noteworthy training steps. Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch. This opens the possibility of simultaneously training two models tailored to different deployment constraints with competitive or superior performance while keeping training cost lower than when training the larger model.
22. 【2602.21054】VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
链接:https://arxiv.org/abs/2602.21054
作者:Seongheon Park,Changdae Oh,Hyeong Kyu Choi,Xuefeng Du,Sharon Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, frequently hallucinate, limiting their safe, real-world applications
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
23. 【2602.21053】OCR-Agent: Agentic OCR with Capability and Memory Reflection
链接:https://arxiv.org/abs/2602.21053
作者:Shimin Wen,Zeyu Zhang,Xingdou Bian,Hongjie Zhu,Lulu He,Layi Shama,Daji Ergu,Ying Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, rectify cognitive biases, demonstrated significant potential, generally lack effective, independently rectify cognitive
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization this http URL, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer this http URL address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: this https URL.
24. 【2602.21042】OmniOCR: Generalist OCR for Ethnic Minority Languages
链接:https://arxiv.org/abs/2602.21042
作者:Bonan Liu,Zeyu Zhang,Bingbing Meng,Han Wang,Hanshuo Zhang,Chengping Wang,Daji Ergu,Ying Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Optical character recognition, Latin and Chinese, Optical character, character recognition, advanced rapidly
备注:
点击查看摘要
Abstract:Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: this https URL.
25. 【2602.21035】Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
链接:https://arxiv.org/abs/2602.21035
作者:Junhao Xiao,Zhiyu Wu,Hao Lin,Yi Chen,Yahui Liu,Xiaoran Zhao,Zixu Wang,Zejiang He
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Vision-Language Models, dog images, negatives similarly, struggle to understand, affirmatives and negatives
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.
26. 【2602.21033】MIP Candy: A Modular PyTorch Framework for Medical Image Processing
链接:https://arxiv.org/abs/2602.21033
作者:Tianhao Fu,Yucheng Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:heterogeneous file formats, demands specialized software, handles high-dimensional volumetric, processing demands specialized, Medical image processing
备注:
点击查看摘要
Abstract:Medical image processing demands specialized software that handles high-dimensional volumetric data, heterogeneous file formats, and domain-specific training procedures. Existing frameworks either provide low-level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch-based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while retaining fine-grained control over every component. Central to the design is $\texttt{LayerT}$, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built-in $k$-fold cross-validation, dataset inspection with automatic region-of-interest detection, deep supervision, exponential moving average, multi-frontend experiment tracking (Weights Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre-built model implementations that follow a consistent trainer--predictor pattern and integrate with the core framework without modification. MIPCandy is open-source under the Apache-2.0 license and requires Python~3.12 or later. Source code and documentation are available at this https URL.
27. 【2602.21015】From Perception to Action: An Interactive Benchmark for Vision Reasoning
链接:https://arxiv.org/abs/2602.21015
作者:Yuhao Wu,Maojia Song,Yihuai Lan,Lei Wang,Zhiqiang Hu,Yao Xiao,Heng Zhou,Weihua Zheng,Dylan Raharja,Soujanya Poria,Roy Ka-Wei Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:embodied agents, essential for real-world, real-world applications, interactive design, Understanding
备注: Work in processing. Website: [this https URL](https://social-ai-studio.github.io/CHAIN/)
点击查看摘要
Abstract:Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at this https URL.
28. 【2602.21010】Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design
链接:https://arxiv.org/abs/2602.21010
作者:Jiannan Huang,Aditya Kane,Fengzhe Zhou,Yunchao Wei,Humphrey Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:textbf, requires high accuracy, Real-time object detection, real-time DETR models, real-time DETR
备注: CVPR Findings
点击查看摘要
Abstract:Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbf{L}ow-cost and \textbf{E}fficient \textbf{DE}tection \textbf{TR}ansformer), which achieves a new \textbf{SOTA} in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80\% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf{52.9/54.3/55.1 mAP} on COCO Val2017 with \textbf{4.45/5.01/6.68 ms} on an RTX4090. It surpasses YOLOv12-L/X by \textbf{+0.6/-0.1 mAP} while achieving similar speed and \textbf{+20\%} speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf{+0.2 mAP} with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf{+0.4 mAP} with only \textbf{0.4 ms} additional latency. Code and weights will be open-sourced.
29. 【2602.20999】VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
链接:https://arxiv.org/abs/2602.20999
作者:Bowen Zheng,Yongli Xiang,Ziming Hong,Zerong Lin,Chaojian Yu,Tongliang Liu,Xinge You
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:implicit control signals, shown emerging visual, emerging visual instruction-following, condition video generation, visual instruction-following capability
备注: Project page: [this https URL](https://Zbwwwwwwww.github.io/VII)
点击查看摘要
Abstract:Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.
30. 【2602.20989】Cycle-Consistent Tuning for Layered Image Decomposition
链接:https://arxiv.org/abs/2602.20989
作者:Zheng Gu,Min Lu,Zhida Sun,Dani Lischinski,Daniel Cohen-O,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Disentangling visual layers, Disentangling visual, globally coupled interactions, including shading, vision and graphics
备注: Accepted to CVPR 2026. Project page: [this https URL](https://vcc.tech/research/2026/ImgDecom)
点击查看摘要
Abstract:Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.
31. 【2602.20985】EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
链接:https://arxiv.org/abs/2602.20985
作者:Munish Monga,Vishal Chudasama,Pankaj Wasnik,C.V. Jawahar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world object detection, accessing prior data, Real-world object, World Object Detection, Evolving World Object
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as "unknown": all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.
32. 【2602.20981】Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
链接:https://arxiv.org/abs/2602.20981
作者:Christian Simon,MAsato Ishii,Wei-Yao Wang,Koichi Saito,Akio Hayakawa,Dongseok Shim,Zhi Zhong,Shuyang Cui,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:frame-level video information, Scaling multimodal alignment, video information, frame-level video, due to limited
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
33. 【2602.20980】CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
链接:https://arxiv.org/abs/2602.20980
作者:Yang Zhang,Danyang Li,Yuxuan Li,Xin Zhang,Tianyu Xie,Mingming Cheng,Xiang Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, integrating powerful language, powerful language backbones, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.
34. 【2602.20972】Are Multimodal Large Language Models Good Annotators for Image Tagging?
链接:https://arxiv.org/abs/2602.20972
作者:Ming-Kun Xie,Jia-Hao Xiao,Zhiqiang Kou,Zhongnian Li,Gang Niu,Masashi Sugiyama
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental vision task, train multi-label classifiers, incurs significant labor, Large Language Models, Multimodal Large Language
备注:
点击查看摘要
Abstract:Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50\% to 80\% of human performance, while achieving over 90\% performance on downstream training this http URL by these findings, we propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations. TagLLM comprises two components: Candidates generation, which employs structured group-wise prompting to efficiently produce a compact candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and label disambiguation, which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels. Extensive experiments show that TagLLM substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60\% to 80\% of the difference.
35. 【2602.20951】See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
链接:https://arxiv.org/abs/2602.20951
作者:Jaehyun Park,Minyoung Ahn,Minkyu Kim,Jonghyun Lee,Jae-Gil Lee,Dongmin Park
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:compromise realism, recent advances, visual artifacts, generated images, artifacts
备注:
点击查看摘要
Abstract:Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.
36. 【2602.20947】Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation
链接:https://arxiv.org/abs/2602.20947
作者:Thorbjørn Mosekjær Iversen,Zebin Duan,Frederik Hagelskjær
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Wilson Score Kernel, Score Kernel Density, deep learning-based binary, recent years, learning-based binary classifiers
备注:
点击查看摘要
Abstract:The performance and ease of use of deep learning-based binary classifiers have improved significantly in recent years. This has opened up the potential for automating critical inspection tasks, which have traditionally only been trusted to be done manually. However, the application of binary classifiers in critical operations depends on the estimation of reliable confidence bounds such that system performance can be ensured up to a given statistical significance. We present Wilson Score Kernel Density Classification, which is a novel kernel-based method for estimating confidence bounds in binary classification. The core of our method is the Wilson Score Kernel Density Estimator, which is a function estimator for estimating confidence bounds in Binomial experiments with conditionally varying success probabilities. Our method is evaluated in the context of selective classification on four different datasets, illustrating its use as a classification head of any feature extractor, including vision foundation models. Our proposed method shows similar performance to Gaussian Process Classification, but at a lower computational complexity.
37. 【2602.20943】UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
链接:https://arxiv.org/abs/2602.20943
作者:Kaiyuan Tan,Yingying Shen,Mingfei Tu,Haohui Zhu,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving simulation, closed-loop learning, critical for autonomous, simulation and closed-loop, feed-forward methods
备注:
点击查看摘要
Abstract:Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.
38. 【2602.20933】Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
链接:https://arxiv.org/abs/2602.20933
作者:Shuangkang Fang,I-Chao Shen,Xuanyang Zhang,Zesheng Wang,Yufeng Wang,Wenrui Ding,Gang Yu,Takeo Igarashi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:nullifying Gaussian opacities, Gaussian Splatting, randomly nullifying Gaussian, Gaussian opacities, sparse-view conditions
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recent 3D Gaussian Splatting (3DGS) Dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based Dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies near anchors and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the Dropout to color attributes by randomly dropping higher-degree SH to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing Dropout methods with negligible computational overhead, and can be readily integrated into various 3DGS variants to enhance their performances. Project Website: this https URL
39. 【2602.20930】Computing a Characteristic Orientation for Rotation-Independent Image Analysis
链接:https://arxiv.org/abs/2602.20930
作者:Cristian Valero-Abundio,Emilio Sansano-Sansano,Raúl Montoliu,Marina Martínez García
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Handling geometric transformations, Handling geometric, geometric transformations, computer vision, challenge in deep
备注: Accepted for publication at the 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026). 8 pages
点击查看摘要
Abstract:Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convolutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.
40. 【2602.20925】LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments
链接:https://arxiv.org/abs/2602.20925
作者:Zeyu Jiang,Kuan Xu,Changhao Chen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Thermal cameras offer, offer strong potential, cameras offer strong, thermal Simultaneous Localization, weather conditions
备注: ICRA 2026
点击查看摘要
Abstract:Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semantic-geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.
41. 【2602.20913】LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
链接:https://arxiv.org/abs/2602.20913
作者:Jihao Qiu,Lingxi Xie,Xinyue Huo,Qi Tian,Qixiang Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low computational budgets, computational budgets, paper addresses, addresses the critical, critical and underexplored
备注: 17 pages, 9 figures, 8 tables, accepted to CVPR 2026
点击查看摘要
Abstract:This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: this https URL
42. 【2602.20911】From Isolation to Integration: Building an Adaptive Expert Forest for Pre-Trained Model-based Class-Incremental Learning
链接:https://arxiv.org/abs/2602.20911
作者:Ruiqi Liu,Boyu Diao,Hangda Liu,Zhulin An,Fei Wang,Yongjun Xu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Class-Incremental Learning, learn new classes, requires models, CIL, Learning
备注:
点击查看摘要
Abstract:Class-Incremental Learning (CIL) requires models to learn new classes without forgetting old ones. A common method is to freeze a pre-trained model and train a new, lightweight adapter for each task. While this prevents forgetting, it treats the learned knowledge as a simple, unstructured collection and fails to use the relationships between tasks. To this end, we propose the Semantic-guided Adaptive Expert Forest (SAEF), a new method that organizes adapters into a structured hierarchy for better knowledge sharing. SAEF first groups tasks into conceptual clusters based on their semantic relationships. Then, within each cluster, it builds a balanced expert tree by creating new adapters from merging the adapters of similar tasks. At inference time, SAEF finds and activates a set of relevant experts from the forest for any given input. The final prediction is made by combining the outputs of these activated experts, weighted by how confident each expert is. Experiments on several benchmark datasets show that SAEF achieves SOTA performance.
43. 【2602.20903】xtPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
链接:https://arxiv.org/abs/2602.20903
作者:Hanshen Zhu,Yuliang Liu,Xuecheng Wu,An-Lan Wang,Hao Feng,Dingkang Yang,Chao Feng,Can Huang,Jingqun Tang,Xiang Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advanced models frequently, models frequently produce, frequently produce text, frequently produce, specialist OCR models
备注: Code: [this https URL](https://github.com/CIawevy/TextPecker)
点击查看摘要
Abstract:Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.
44. 【2602.20901】SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
链接:https://arxiv.org/abs/2602.20901
作者:Yuechen Xie,Xiaoyan Zhang,Yicheng Shan,Hao Zhu,Rui Tang,Rong Wei,Mingli Song,Yuanyu Wan,Jie Song
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:spatial logical reasoning, spatial logical, real-world scenarios due, logical reasoning, logical
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at this https URL.
45. 【2602.20880】When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
链接:https://arxiv.org/abs/2602.20880
作者:Yongli Xiang,Ziming Hong,Zhaoqing Wang,Xiangyu Zhao,Bo Han,Tongliang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating high-quality images, demonstrated significant advancements, raising potential safety, potential safety concerns, high-quality images
备注: CVPR 2026; Code is released at [this https URL](https://github.com/tmllab/2026_CVPR_CASG)
点击查看摘要
Abstract:Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
46. 【2602.20873】MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
链接:https://arxiv.org/abs/2602.20873
作者:Jiahao Xu,Sheng Huang,Xin Zhang,Zhixiong Nan,Jiajun Dong,Nankun Mu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:slide image classification, slide image, expert-labeled slides, image classification, classification is primarily
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: this https URL.
47. 【2602.20860】DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation
链接:https://arxiv.org/abs/2602.20860
作者:Wangkai Li,Rui Sun,Zhaoyang Li,Yujia Chen,Tianzhu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:methods greatly enhance, unsupervised domain adaptation, greatly enhance target, methods greatly, resulting in misalignment
备注:
点击查看摘要
Abstract:While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy -- a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo-labels to hard pseudo-labels. Based on this finding, we propose DA-Cal, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization. DA-Cal introduces a Meta Temperature Network to generate pixel-level calibration parameters and employs bi-level optimization to establish the relationship between soft pseudo-labels and UDA supervision, while utilizing complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.
48. 【2602.20853】On the Explainability of Vision-Language Models in Art History
链接:https://arxiv.org/abs/2602.20853
作者:Stefanie Schneider
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shared embedding space, Vision-Language Models, embedding space, Explainable Artificial Intelligence, textual data
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.
49. 【2602.20851】Hybrid Fusion: One-Minute Efficient Training for Zero-Shot Cross-Domain Image Fusion
链接:https://arxiv.org/abs/2602.20851
作者:Ran Zhang,Xuanhua He,Liu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Image fusion seeks, superior image, integrate complementary information, seeks to integrate, integrate complementary
备注:
点击查看摘要
Abstract:Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve state-of-the-art (SOTA) results but suffer from critical inefficiencies: their reliance on slow, resource-intensive, patch-based training introduces a significant gap with full-resolution inference. We propose a novel hybrid framework that resolves this trade-off. Our method utilizes a learnable U-Net to generate a dynamic guidance map that directs a classic, fixed Laplacian pyramid fusion kernel. This decoupling of policy learning from pixel synthesis enables remarkably efficient full-resolution training, eliminating the train-inference gap. Consequently, our model achieves SOTA-comparable performance in about one minute on a RTX 4090 or two minutes on a consumer laptop GPU from scratch without any external model and demonstrates powerful zero-shot generalization across diverse tasks, from infrared-visible to medical imaging. By design, the fused output is linearly constructed solely from source information, ensuring high faithfulness for critical applications. The codes are available at this https URL
50. 【2602.20845】FLIM Networks with Bag of Feature Points
链接:https://arxiv.org/abs/2602.20845
作者:João Deltregia Martinelli,Marcelo Luis Rodrigues Filho,Felipe Crispim da Rocha Salvagnini,Gilson Junior Soares,Jefersson A. dos Santos,Alexandre X. Falcão
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional networks require, extensive image annotation, require extensive image, networks require extensive, Convolutional networks
备注: Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer
点击查看摘要
Abstract:Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder's block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.
51. 【2602.20839】raining-Free Multi-Concept Image Editing
链接:https://arxiv.org/abs/2602.20839
作者:Niki Foteinopoulou,Ignas Budvytis,Stephan Liwicki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:training remains challenging, remains challenging, training remains, Abstract, unifies Optimised DDS
备注: 17 pages, 13 figures
点击查看摘要
Abstract:Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.
52. 【2602.20818】GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection
链接:https://arxiv.org/abs/2602.20818
作者:Yingying Guo,Ke Zhang,Zirong Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Detecting hateful content, presents unique challenges, memes presents unique, multimodal memes presents, Detecting hateful
备注: Preprint
点击查看摘要
Abstract:Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP's multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.
53. 【2602.20807】RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction
链接:https://arxiv.org/abs/2602.20807
作者:Yangfan Zhao,Hanwei Zhang,Ke Huang,Qiufeng Wang,Zhenzhou Shao,Dengyu Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Simultaneous Localization, Gaussian Splatting SLAM, Gaussian splatting, enables continuous, gained popularity
备注:
点击查看摘要
Abstract:Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: this https URL
54. 【2602.20794】VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
链接:https://arxiv.org/abs/2602.20794
作者:Jie Wang,Guang Li,Zhijian Huang,Chenxu Dang,Hangjun Ye,Yahong Han,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:geometric modeling capabilities, autonomous driving, existing Vision-Language Models, cross-view geometric grounding, inherently lack
备注: CVPR 2026
点击查看摘要
Abstract:The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing QA data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.
55. 【2602.20792】SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
链接:https://arxiv.org/abs/2602.20792
作者:Muhammad Saif Ullah Khan,Didier Stricker
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex multi-joint kinematics, spine complex multi-joint, lack of large-scale, fundamental to understanding, remains underexplored
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.
56. 【2602.20790】Real-time Motion Segmentation with Event-based Normal Flow
链接:https://arxiv.org/abs/2602.20790
作者:Sheng Zhong,Zhongyang Ren,Xiya Zhu,Dehao Yuan,Cornelia Fermuller,Yi Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:handle visual tasks, microsecond resolution, offering the potential, challenging scenarios, cameras are bio-inspired
备注:
点击查看摘要
Abstract:Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework.
57. 【2602.20773】Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization
链接:https://arxiv.org/abs/2602.20773
作者:Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, remains difficult due, Artificial intelligence, models remains difficult, privacy-constrained imaging data
备注: Submitted to IEEE JBHI
点击查看摘要
Abstract:Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.
58. 【2602.20752】OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
链接:https://arxiv.org/abs/2602.20752
作者:Tian Lan,Lei Xu,Zimu Yuan,Shanggui Liu,Jiajun Liu,Jiaxin Liu,Weilai Xiang,Hongyu Yang,Dong Jiang,Jianxin Yin,Dingyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:global health burden, Musculoskeletal disorders represent, significant global health, musculoskeletal MRI interpretation, disability worldwide
备注:
点击查看摘要
Abstract:Musculoskeletal disorders represent a significant global health burden and are a leading cause of disability worldwide. While MRI is essential for accurate diagnosis, its interpretation remains exceptionally challenging. Radiologists must identify multiple potential abnormalities within complex anatomical structures across different imaging planes, a process that requires significant expertise and is prone to variability. We developed OrthoDiffusion, a unified diffusion-based foundation model designed for multi-task musculoskeletal MRI interpretation. The framework utilizes three orientation-specific 3D diffusion models, pre-trained in a self-supervised manner on 15,948 unlabeled knee MRI scans, to learn robust anatomical features from sagittal, coronal, and axial views. These view-specific representations are integrated to support diverse clinical tasks, including anatomical segmentation and multi-label diagnosis. Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities. The model exhibited remarkable robustness across different clinical centers and MRI field strengths, consistently outperforming traditional supervised models. Notably, in settings where labeled data was scarce, OrthoDiffusion maintained high diagnostic precision using only 10\% of training labels. Furthermore, the anatomical representations learned from knee imaging proved highly transferable to other joints, achieving strong diagnostic performance across 11 diseases of the ankle and shoulder. These findings suggest that diffusion-based foundation models can serve as a unified platform for multi-disease diagnosis and anatomical segmentation, potentially improving the efficiency and accuracy of musculoskeletal MRI interpretation in real-world clinical workflows.
59. 【2602.20739】PyVision-RL: Forging Open Agentic Vision Models via RL
链接:https://arxiv.org/abs/2602.20739
作者:Shitian Zhao,Shaoheng Lin,Ming Li,Haoquan Zhang,Wenshuo Peng,Kaipeng Zhang,Chen Wei
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:agentic multimodal models, Reinforcement learning, reinforcement learning framework, agentic behavior, limiting the benefits
备注: preprint
点击查看摘要
Abstract:Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
60. 【2602.20731】Communication-Inspired Tokenization for Structured Image Representations
链接:https://arxiv.org/abs/2602.20731
作者:Aram Davtyan,Yusuf Sahin,Yasaman Haghighi,Sebastian Stapf,Pablo Acuaviva,Alexandre Alahi,Paolo Favaro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:multimodal systems, transformer-based architectures, Discrete image tokenizers, tokenizers have emerged, key component
备注: Project website: [this https URL](https://araachie.github.io/comit/)
点击查看摘要
Abstract:Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
61. 【2602.20725】Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation
链接:https://arxiv.org/abs/2602.20725
作者:Junwei Shu,Wenjie Liu,Changgu Chen,Hantang Liu,Yang Li,Changbo Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:producing realistic content, limited explicit control, image generators excel, Diffusion-based image generators, generators excel
备注: preprint
点击查看摘要
Abstract:Diffusion-based image generators excel at producing realistic content from text or image conditions, but they offer only limited explicit control over low-level, physically grounded shading and material properties. In contrast, physically based rendering (PBR) offers fine-grained physical control but lacks prompt-driven flexibility. Although these two paradigms originate from distinct communities, both share a common evolution -- from noisy observations to clean images. In this paper, we propose a unified stochastic formulation that bridges Monte Carlo rendering and diffusion-based generative modeling. First, a general stochastic differential equation (SDE) formulation for Monte Carlo integration under the Central Limit Theorem is modeled. Through instantiation via physically based path tracing, we convert it into a physically grounded SDE representation. Moreover, we provide a systematic analysis of how the physical characteristics of path tracing can be extended to existing diffusion models from the perspective of noise variance. Extensive experiments across multiple tasks show that our method can exert physically grounded control over diffusion-generated results, covering tasks such as rendering and material editing.
62. 【2602.20721】CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization
链接:https://arxiv.org/abs/2602.20721
作者:Xiaoman Feng,Mingkun Lei,Yang Wang,Dingwen Fu,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Style, style image undesirably, Style transfer, style embedding, tail components
备注: 26 pages
点击查看摘要
Abstract:Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.
63. 【2602.20718】Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization
链接:https://arxiv.org/abs/2602.20718
作者:Yangsen Chen,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieving robot-assisted surgery, Reconstructing deformable endoscopic, Gaussian Splatting, Reconstructing deformable, robot-assisted surgery
备注: ijcnn 2025
点击查看摘要
Abstract:Reconstructing deformable endoscopic tissues is crucial for achieving robot-assisted surgery. However, 3D Gaussian Splatting-based approaches encounter challenges in achieving consistent tissue surface reconstruction, while existing NeRF-based methods lack real-time rendering capabilities. In pursuit of both smooth deformable surfaces and real-time rendering, we introduce a novel approach based on 3D Gaussian Splatting. Specifically, we introduce surface-aware reconstruction, initially employing a Sign Distance Field-based method to construct a mesh, subsequently utilizing this mesh to constrain the Gaussian Splatting reconstruction process. Furthermore, to ensure the generation of physically plausible deformations, we incorporate local rigidity and global non-rigidity restrictions to guide Gaussian deformation, tailored for the highly deformable nature of soft endoscopic tissue. Based on 3D Gaussian Splatting, our proposed method delivers a fast rendering process and smooth surface appearances. Quantitative and qualitative analysis against alternative methodologies shows that our approach achieves solid reconstruction quality in both textures and geometries.
64. 【2602.20709】Onboard-Targeted Segmentation of Straylight in Space Camera Sensors
链接:https://arxiv.org/abs/2602.20709
作者:Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:space camera faults, artificial intelligence, study details, details an artificial, camera faults
备注: Submitted to Aerospace Science and Technology
点击查看摘要
Abstract:This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults. Specifically, we address the segmentation of straylight effects induced by solar presence around the camera's Field of View (FoV). Anomalous images are sourced from our published dataset. Our approach emphasizes generalization across diverse flare textures, leveraging pre-training on a public dataset (Flare7k++) including flares in various non-space contexts to mitigate the scarcity of realistic space-specific data. A DeepLabV3 model with MobileNetV3 backbone performs the segmentation task. The model design targets deployment in spacecraft resource-constrained hardware. Finally, based on a proposed interface between our model and the onboard navigation pipeline, we develop custom metrics to assess the model's performance in the system-level context.
65. 【2602.20700】NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image
链接:https://arxiv.org/abs/2602.20700
作者:Anna Badalyan,Pratheba Selvaraju,Giorgio Becherini,Omid Taheri,Victoria Fernandez Abrevaya,Michael Black
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Estimating sewing patterns, Estimating sewing, creating high-quality, language, Estimating
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.
66. 【2602.20689】MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
链接:https://arxiv.org/abs/2602.20689
作者:Bedrettin Cetinkaya,Sinan Kalkan,Emre Akbas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:edge maps remains, Generating crisp, edge detection, affecting both traditional, maps remains
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2--4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20--35\% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at this https URL.
67. 【2602.20685】RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation
链接:https://arxiv.org/abs/2602.20685
作者:Yichen Xie,Chensheng Peng,Mazen Abdelfattah,Yihan Hu,Jiezhi Yang,Eric Higgins,Ryan Brigden,Masayoshi Tomizuka,Wei Zhan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:physically plausible behavior, foundation models aim, World foundation models, plausible behavior, multiview world model
备注: Accepted by CVPR 2026; Project website: [this http URL](http://yichen928.github.io/raynova)
点击查看摘要
Abstract:World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at this https URL.
68. 【2602.20673】GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio
链接:https://arxiv.org/abs/2602.20673
作者:Hao Zhang,Lue Fan,Qitai Wang,Wenbo Li,Zehuan Wu,Lewei Lu,Zhaoxiang Zhang,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving systems, high-fidelity driving simulator, autonomous driving, driving systems, training and evaluating
备注:
点击查看摘要
Abstract:A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.
69. 【2602.20672】BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models
链接:https://arxiv.org/abs/2602.20672
作者:Eliran Kachlon,Alexander Visheratin,Nimrod Sarid,Tal Hacham,Eyal Gutflaish,Saar Huberman,Hezi Zisman,David Ruppin,Ron Mokady
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:approaches leveraging long, recent approaches leveraging, support fine-grained generation, realism and controllability, leveraging long
备注:
点击查看摘要
Abstract:Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.
70. 【2602.20666】BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity
链接:https://arxiv.org/abs/2602.20666
作者:Juil Koo,Wei-Tung Lin,Chanho Park,Chanhyeok Park,Minhyuk Sung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:model, bounding boxes, generative model, abstract ideas, Human creativity
备注: Project page: [this https URL](https://boxsplitgen.github.io)
点击查看摘要
Abstract:Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation -- particularly for detailing abstractions from coarse to fine -- have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technical components of our framework are two generative models: the box-splitting generative model and the box-to-shape generative model. The first model, named BoxSplitGen, generates a collection of 3D part bounding boxes with varying granularity by iteratively splitting coarse bounding boxes. It utilizes part bounding boxes created through agglomerative merging and learns the reverse of the merging process -- the splitting sequences. The model consists of two main components: the first learns the categorical distribution of the box to be split, and the second learns the distribution of the two new boxes, given the set of boxes and the indication of which box to split. The second model, the box-to-shape generative model, is trained by leveraging the 3D shape priors learned by an existing 3D diffusion model while adapting the model to incorporate bounding box conditioning. In our experiments, we demonstrate that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model. Also, we show that our box-to-shape model, based on a state-of-the-art 3D diffusion model, provides superior results compared to a previous model.
71. 【2602.20664】AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?
链接:https://arxiv.org/abs/2602.20664
作者:Hailong Yan,Shice Liu,Tao Wang,Xiangtao Zhang,Yijie Zhong,Jinwei Chen,Le Zhang,Bo Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Custom Storyboard Generation, Custom Storyboard, Storyboard Generation, multi-character consistent storytelling, aims to produce
备注: Tech Report
点击查看摘要
Abstract:Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to "copy-paste" pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for assessing stylized, non-realistic animation. To address these, we propose AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework for CSG. Inspired by Disney's "Combination of Straight Ahead and Pose to Pose" workflow, AnimeAgent leverages I2V's implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement. We also collect a human-annotated CSG benchmark with ground-truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.
72. 【2602.20658】Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video
链接:https://arxiv.org/abs/2602.20658
作者:Mohammad Sadra Rajabi,Aanuoluwapo Ojelade,Sunwook Kim,Maury A. Nussbaum
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:ergonomic risk assessment, work-related musculoskeletal disorders, quantifying physical exposure, informing ergonomic interventions, NIOSH Lifting Equation
备注:
点击查看摘要
Abstract:Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.
73. 【2602.20653】SD4R: Sparse-to-Dense Learning for 3D Object Detection with 4D Radar
链接:https://arxiv.org/abs/2602.20653
作者:Xiaokai Bai,Jiahao Cheng,Songkai Wang,Yixuan Luo,Lianqing Zheng,Xiaohan Zhang,Si-Yuan Cao,Hui-Liang Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:radar measurements offer, radar point clouds, point clouds, measurements offer, offer an affordable
备注: 7 pages, 5 figures, 4 tables
点击查看摘要
Abstract:4D radar measurements offer an affordable and weather-robust solution for 3D perception. However, the inherent sparsity and noise of radar point clouds present significant challenges for accurate 3D object detection, underscoring the need for effective and robust point clouds densification. Despite recent progress, existing densification methods often fail to address the extreme sparsity of 4D radar point clouds and exhibit limited robustness when processing scenes with a small number of points. In this paper, we propose SD4R, a novel framework that transforms sparse radar point clouds into dense representations. SD4R begins by utilizing a foreground point generator (FPG) to mitigate noise propagation and produce densified point clouds. Subsequently, a logit-query encoder (LQE) enhances conventional pillarization, resulting in robust feature representations. Through these innovations, our SD4R demonstrates strong capability in both noise reduction and foreground point densification. Extensive experiments conducted on the publicly available View-of-Delft dataset demonstrate that SD4R achieves state-of-the-art performance. Source code is available at this https URL.
74. 【2602.20650】Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression
链接:https://arxiv.org/abs/2602.20650
作者:Chenyue Yu,Lingao Xiao,Jinhong Deng,Ivor W. Tsang,Yang He
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:demands pose challenges, Large-scale image datasets, Large-scale image, high storage demands, storage demands pose
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{this https URL}{this https URL}.
75. 【2602.20636】SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement
链接:https://arxiv.org/abs/2602.20636
作者:Rulin Zhou,Guankun Wang,An Wang,Yujie Ma,Lixin Ouyang,Bolin Cui,Junyan Li,Chaowei Zhu,Mingyang Li,Ming Chen,Xiaopin Zhong,Peng Lu,Jiankun Wang,Xianming Liu,Hongliang Ren
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:minimally invasive surgery, efficient minimally invasive, direct object-centric assumptions, Accurate and stable, conflate visual attention
备注:
点击查看摘要
Abstract:Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.
76. 【2602.20632】Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection
链接:https://arxiv.org/abs/2602.20632
作者:Xiaokai Bai,Lianqing Zheng,Si-Yuan Cao,Xiaohan Zhang,Zhe Wu,Beinan Yu,Fang Wang,Jie Bai,Hui-Liang Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:promising sensing modality, autonomous driving due, robustness and affordability, promising sensing, sensing modality
备注: 14 pages, 10 figures, 13 tables
点击查看摘要
Abstract:4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at this http URL.
77. 【2602.20630】From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
链接:https://arxiv.org/abs/2602.20630
作者:Yepeng Liu,Hao Li,Liwen Yang,Fangzhen Li,Xudi Ge,Yuliang Gu,kuang Gao,Bing Wang,Guang Chen,Hangjun Ye,Yongchao Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision systems, component of modern, Keypoint-based matching, fundamental component, Keypoint-based
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.
78. 【2602.20627】Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection
链接:https://arxiv.org/abs/2602.20627
作者:Zhaonian Kuang,Rui Ding,Meng Yang,Xinhu Zheng,Gang Hua
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:deep learning based, strong human bias, high-performance deep learning, http URL, complicated visual variation
备注: IJCV
点击查看摘要
Abstract:Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera this http URL, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D objects are always captured in particular scenes with fixed camera poses, and hence lacks necessary diversity. Such tight entanglement induces the challenging issues of insufficient utilization and overfitting to uniform training data. To mitigate this, we propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data. We first fully decompose training images into textured 3D object point models and background scenes in an efficient computation and storage manner. We then continuously recompose new training images in each epoch by inserting the 3D objects into the freespace of the background scenes, and rendering them with perturbed camera poses from textured 3D point representation. In this way, the refreshed training data in all epochs can cover the full spectrum of independent object, scene, and camera pose combinations. This scheme can serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings. In the sparsely-supervised setting, objects closest to the ego-camera for all instances are sparsely annotated. We then can flexibly increase the annotated objects to control annotation cost. For validation, our method is widely applied to five representative M3OD models and evaluated on both the KITTI and the more complicated Waymo datasets.
79. 【2602.20618】RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
链接:https://arxiv.org/abs/2602.20618
作者:Haonan An,Xiaohui Ye,Guang Hua,Yihang Tao,Hangcheng Cao,Xiangyu Yu,Yuguang Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:posing unprecedented challenges, severely undermining visual, undermining visual integrity, facilitated sophisticated face, severely undermining
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.
80. 【2602.20616】Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model
链接:https://arxiv.org/abs/2602.20616
作者:Xueqiang Lv,Shizhou Zhang,Yinghui Xing,Di Xu,Peng Wang,Yanning Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Open-world object detection, requires incrementally detecting, reliably identifying unknown, Open-world object, requires incrementally
备注:
点击查看摘要
Abstract:Open-world object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known-unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly "knowing the unknown". To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known-unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.
81. 【2602.20608】VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos
链接:https://arxiv.org/abs/2602.20608
作者:Aihua Mao,Kaihang Huang,Yong-Jin Liu,Chee Seng Chan,Ying He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:embodied visual reasoning, aims to identify, capability essential, essential to embodied, visual reasoning
备注:
点击查看摘要
Abstract:3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.
82. 【2602.20597】Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing
链接:https://arxiv.org/abs/2602.20597
作者:Yuejiao Su,Yi Wang,Lei Yao,Yawen Cui,Lap-Pui Chau
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:next-generation embodied agents, developing next-generation embodied, egocentric human-environment interactions, embodied agents, fine-grained understanding
备注:
点击查看摘要
Abstract:A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to "interaction illusion", producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at this https URL.
83. 【2602.20584】Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change
链接:https://arxiv.org/abs/2602.20584
作者:Beverley Gorry,Tobias Fischer,Michael Milford,Alejandro Fontan
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:repeated site visits, site visits separated, environmental monitoring requires, Long-term environmental monitoring, reconstruct and align
备注:
点击查看摘要
Abstract:Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.
84. 【2602.20583】PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
链接:https://arxiv.org/abs/2602.20583
作者:Wonyong Seo,Jaeho Moon,Jaehyup Lee,Soo Ye Kim,Munchurl Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:precise user control, Propagation-based video editing, single edited frame, enables precise user, Propagation-based video
备注: The first two authors contributed equally to this work (equal contribution)
点击查看摘要
Abstract:Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.
85. 【2602.20577】Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion
链接:https://arxiv.org/abs/2602.20577
作者:Jiaru Zhang,Manav Gagvani,Can Cui,Juntong Peng,Ruqi Zhang,Ziran Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Large Language, emerged as promising, promising candidates, Vision-Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.
86. 【2602.20575】An interactive enhanced driving dataset for autonomous driving
链接:https://arxiv.org/abs/2602.20575
作者:Haojie Feng,Peizhi Zhang,Mengjie Tian,Xinrui Zhang,Zhuoren Li,Junpeng Huang,Xiurong Wang,Junfan Zhu,Jianzhou Wang,Dongxiao Yin,Lu Xiong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:full automation demands, automation demands robust, inadequate multimodal alignment, demands robust interactive, Interactive Enhanced Driving
备注:
点击查看摘要
Abstract:The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird's Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset's reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.
87. 【2602.20569】AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents
链接:https://arxiv.org/abs/2602.20569
作者:Jiaqi Wu,Yuchen Zhou,Muduo Xu,Zisheng Liang,Simiao Ren,Jiayu Xue,Meige Yang,Siying Chen,Jingheng Huan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:benchmark targeting exclusively, dedicated benchmark targeting, targeting exclusively, Adobe Photoshop, AUC
备注: 17 pages, 10 figures
点击查看摘要
Abstract:We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.
88. 【2602.20566】BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model
链接:https://arxiv.org/abs/2602.20566
作者:Haosheng Li,Weixin Mao,Zihan Lan,Hongwei Xiong,Hongan Wang,Chenyang Si,Ziwei Liu,Xiaoming Deng,Hua Chen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision Language, leveraging Large Vision, Vision Language Models, Large Vision, Vision Language
备注: 9 pages, 10 figures
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the {\pi}0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.
89. 【2602.20556】WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos
链接:https://arxiv.org/abs/2602.20556
作者:Hanhui Li,Xuan Huang,Wanquan Liu,Yuhao Cheng,Long Chen,Yiqiang Yan,Xiaodan Liang,Chenqiang Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing methods rely, extreme poses, hand-object interactions, motion blur, recent progress
备注:
点击查看摘要
Abstract:Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a $15.8\%$ relative gain in PSNR and a $23.1\%$ relative reduction in LPIPS). Our implementation and dataset are available at this https URL.
90. 【2602.20551】CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects
链接:https://arxiv.org/abs/2602.20551
作者:Zhenran Tang,Rohan Nagabhirava,Changliu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scenarios frequently encountered, printing environments, struggles with uncommon, scenarios frequently, Verbal-prompted segmentation
备注:
点击查看摘要
Abstract:Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.
91. 【2602.20550】he Finite Primitive Basis Theorem for Computational Imaging: Formal Foundations of the OperatorGraph Representation
链接:https://arxiv.org/abs/2602.20550
作者:Chengshuai Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Computational imaging forward, coded aperture spectral, aperture spectral cameras, MRI scanners, cameras to MRI
备注:
点击查看摘要
Abstract:Computational imaging forward models, from coded aperture spectral cameras to MRI scanners, are traditionally implemented as monolithic, modality-specific codes. We prove that every forward model in a broad, precisely defined operator class Cimg (encompassing clinical, scientific, and industrial imaging modalities, both linear and nonlinear) admits an epsilon-approximate representation as a typed directed acyclic graph (DAG) whose nodes are drawn from a library of exactly 11 canonical primitives: Propagate, Modulate, Project, Encode, Convolve, Accumulate, Detect, Sample, Disperse, Scatter, and Transform. We call this the Finite Primitive Basis Theorem. The proof is constructive: we provide an algorithm that, given any H in Cimg, produces a DAG G with relative operator error at most epsilon and graph complexity within prescribed bounds. We further prove that the library is minimal: removing any single primitive causes at least one modality to lose its epsilon-approximate representation. A systematic analysis of nonlinearities in imaging physics shows they fall into two structural categories: pointwise scalar functions (handled by Transform) and self-consistent iterations (unrolled into existing linear primitives). Empirical validation on 31 linear modalities confirms eimg below 0.01 with at most 5 nodes and depth 5, and we provide constructive DAG decompositions for 9 additional nonlinear modalities. These results establish mathematical foundations for the Physics World Model (PWM) framework.
92. 【2602.20549】Sample-efficient evidence estimation of score based priors for model selection
链接:https://arxiv.org/abs/2602.20549
作者:Frederic Wang,Katherine L. Bouman
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
关键词:avoid severe bias, model evidence, Bayesian inverse problems, inverse problems, solving inverse problems
备注: ICLR 2026
点击查看摘要
Abstract:The choice of prior is central to solving ill-posed imaging inverse problems, making it essential to select one consistent with the measurements $y$ to avoid severe bias. In Bayesian inverse problems, this could be achieved by evaluating the model evidence $p(y \mid M)$ under different models $M$ that specify the prior and then selecting the one with the highest value. Diffusion models are the state-of-the-art approach to solving inverse problems with a data-driven prior; however, directly computing the model evidence with respect to a diffusion prior is intractable. Furthermore, most existing model evidence estimators require either many pointwise evaluations of the unnormalized prior density or an accurate clean prior score. We propose \method, an estimator of the model evidence of a diffusion prior by integrating over the time-marginals of posterior sampling methods. Our method leverages the large amount of intermediate samples naturally obtained during the reverse diffusion sampling process to obtain an accurate estimation of the model evidence using only a handful of posterior samples (e.g., 20). We also demonstrate how to implement our estimator in tandem with recent diffusion posterior sampling methods. Empirically, our estimator matches the model evidence when it can be computed analytically, and it is able to both select the correct diffusion model prior and diagnose prior misfit under different highly ill-conditioned, non-linear inverse problems, including a real-world black hole imaging problem.
93. 【2602.20548】Robust Spiking Neural Networks Against Adversarial Attacks
链接:https://arxiv.org/abs/2602.20548
作者:Shuai Wang,Malu Zhang,Yulin Jiang,Dehao Zhang,Ammar Belatreche,Yu Liang,Yimeng Shan,Zijian Zhou,Yang Yang,Haizhou Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Spiking Neural Networks, Neural Networks, Spiking Neural, represent a promising, spike-driven characteristics
备注: Published as a conference paper at ICLR 2026
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing due to their bio-plausible and spike-driven characteristics. However, the robustness of SNNs in complex adversarial environments remains significantly constrained. In this study, we theoretically demonstrate that those threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs. We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances. To address this challenge, we propose a Threshold Guarding Optimization (TGO) method, which comprises two key aspects. First, we incorporate additional constraints into the loss function to move neurons' membrane potentials away from their thresholds. It increases SNNs' gradient sparsity, thereby reducing the theoretical upper bound of adversarial attacks. Second, we introduce noisy spiking neurons to transition the neuronal firing mechanism from deterministic to probabilistic, decreasing their state-flipping probability due to minor disturbances. Extensive experiments conducted in standard adversarial scenarios prove that our method significantly enhances the robustness of directly trained SNNs. These findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications.
94. 【2602.20543】Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing
链接:https://arxiv.org/abs/2602.20543
作者:Subhra Jyoti Mandal,Lara Rachidi,Puneet Jain,Matthieu Duvinage,Sander W. Timmer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Environmental Monitoring programs, Environmental Monitoring, Colony-forming unit, stringent quality standards, component of Environmental
备注:
点击查看摘要
Abstract:Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical-grade requirements. A custom Detectron2 model trained on GSK's dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi-agent framework combining DL with vision-language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self-improvement. Initial DL-based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.
95. 【2602.20537】PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
链接:https://arxiv.org/abs/2602.20537
作者:Xinyong Cai,Changbin Sun,Yong Wang,Hongyu Yang,Yuankai Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:forecast future frames, Spatiotemporal predictive learning, predictive learning, aims to forecast, range of applications
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions ($1 \times k$ followed by $k \times 1$), reducing per-channel computational cost from $O(k^2)$ to $O(2k)$. PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs. Our code is available at this https URL.
96. 【2602.20531】A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata
链接:https://arxiv.org/abs/2602.20531
作者:Azrin Sultana,Firoz Ahmed
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:app rating prediction, existing app rating, app rating, significant indicators, rating prediction
备注: 24 pages, 10 figures
点击查看摘要
Abstract:App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.
97. 【2602.20520】How Do Inpainting Artifacts Propagate to Language?
链接:https://arxiv.org/abs/2602.20520
作者:Pratham Yashwante,Davit Abrahamyan,Shresth Grover,Sukruth Rao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:diffusion-based inpainting affect, introduced by diffusion-based, affect language generation, inpainting affect language, visual artifacts introduced
备注:
点击查看摘要
Abstract:We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.
98. 【2602.20511】Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models
链接:https://arxiv.org/abs/2602.20511
作者:Limai Jiang,Ruitao Xie,Bokai Yang,Huazhen Huang,Juan He,Yufu Huo,Zikai Wang,Yang Wei,Yunpeng Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling precise localization, image segmentation plays, clinical decision-making, enabling precise, guiding interventions
备注: Preprint
点击查看摘要
Abstract:Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at this https URL.
99. 【2602.20501】Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models
链接:https://arxiv.org/abs/2602.20501
作者:Qing Zhang,Xuesong Li,Jing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Foundation Models, visual system, Visual Foundation, models, interaction
备注: 11 pages, 12 figures, Accepted to CVPR 2026
点击查看摘要
Abstract:What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.
100. 【2602.20500】Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining
链接:https://arxiv.org/abs/2602.20500
作者:Keyu Zhou,Peisen Xu,Yahao Wu,Jiming Chen,Gaofeng Li,Shunlei Li
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Autonomous laparoscopic camera, rapid tool-tissue interactions, laparoscopic camera control, Autonomous laparoscopic, safe surgical view
备注: Submitted to IEEE Transactions on Robotics (T-RO). 19 pages, 9 figures
点击查看摘要
Abstract:Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.
101. 【2602.20497】LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
链接:https://arxiv.org/abs/2602.20497
作者:Peiliang Cai,Jiacheng Liu,Haowen Xu,Xinyu Wang,Chang Zou,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieved remarkable success, video generation tasks, achieved remarkable, remarkable success, success in image
备注:
点击查看摘要
Abstract:Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
102. 【2602.20496】Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
链接:https://arxiv.org/abs/2602.20496
作者:Jintu Zheng,Qizhe Liu,HuangXin Xu,Zhuojie Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recurrent Neural Networks, Neural Networks, Recurrent Neural, hinders edge deployment, dependence on Recurrent
备注: Accepted to CVPR 2026 (3D vision track)
点击查看摘要
Abstract:While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\times$ speedup, 76.6\% memory peak reduction and 80.9\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: this https URL.
103. 【2602.20479】Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation
链接:https://arxiv.org/abs/2602.20479
作者:Lin Li,Ziqi Jiang,Gefan Ye,Zhenqi He,Jiahui Li,Jun Xiao,Kwang-Ting Cheng,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cross-modal few-shot adaptation, few-shot adaptation treat, adaptation treat visual-semantic, Recent advances, Hyperbolic Flow Matching
备注:
点击查看摘要
Abstract:Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold's exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail'' rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.
104. 【2602.20476】SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
链接:https://arxiv.org/abs/2602.20476
作者:Anindita Ghosh,Vladislav Golyanik,Taku Komura,Philipp Slusallek,Christian Theobalt,Rishabh Dabral
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthesizing text-driven, realistic scenes requires, scenes requires learning, avoiding collisions, requires learning
备注: 13 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.
105. 【2602.20423】MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
链接:https://arxiv.org/abs/2602.20423
作者:Taha Koleilat,Hojat Asgariandehkordi,Omid Nejati Manzari,Berardino Barile,Yiming Xiao,Hassan Rivaz
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:ambiguous anatomical features, Medical image segmentation, remains challenging due, image segmentation remains, Medical image
备注: CVPR 2026; Project Page: [this https URL](https://tahakoleilat.github.io/MedCLIPSeg)
点击查看摘要
Abstract:Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.
106. 【2602.20417】gQIR: Generative Quanta Image Reconstruction
链接:https://arxiv.org/abs/2602.20417
作者:Aryan Garg,Sizhuo Ma,Mohit Gupta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Capturing high-quality images, Capturing high-quality, fundamental challenge, challenge in computational, Capturing
备注: CVPR 2026
点击查看摘要
Abstract:Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{this https URL}{this https URL}.
107. 【2602.20412】SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
链接:https://arxiv.org/abs/2602.20412
作者:Aayush Dhakal,Subash Khanal,Srikumar Sastry,Jacob Arndt,Philipe Ambrozio Dias,Dalton Lunga,Nathan Jacobs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fake image detection, research and society, rapid advancement, advancement of generative, critical challenge
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85\% accuracy and +69.62\% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.
108. 【2602.20409】CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
链接:https://arxiv.org/abs/2602.20409
作者:Mainak Singha,Sarthak Mehrotra,Paolo Casari,Subhasis Chaudhuri,Elisa Ricci,Biplab Banerjee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent vision-language models, impressive cross-modal reasoning, demonstrate impressive cross-modal, Recent vision-language, CLIP demonstrate impressive
备注: Accepted in CVPR 2026
点击查看摘要
Abstract:Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at this https URL.
109. 【2602.20363】Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
链接:https://arxiv.org/abs/2602.20363
作者:Sheyang Tang,Armin Shafiee Sarvestani,Jialu Xu,Xiaoyu Xu,Zhou Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scene depends strongly, depends strongly, aesthetic, scene depends, limited camera adjustments
备注: 14 pages, 10 figures
点击查看摘要
Abstract:The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.
110. 【2602.20360】Momentum Guidance: Plug-and-Play Guidance for Flow Models
链接:https://arxiv.org/abs/2602.20360
作者:Runlong Liao,Jian Yu,Baiyu Su,Chi Zhang,Lizhang Chen,Qiang Liu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:high-quality generative modeling, vanilla conditional form, lack fine-grained detail, fine-grained detail due, generative modeling
备注:
点击查看摘要
Abstract:Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG's effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.
111. 【2602.20354】3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism
链接:https://arxiv.org/abs/2602.20354
作者:Bhavik Chandna,Kelsey R. Allen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:evolving rapidly, generation is evolving, video, video generation, realism
备注:
点击查看摘要
Abstract:AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at this https URL.
112. 【2602.20351】BiRQA: Bidirectional Robust Quality Assessment for Images
链接:https://arxiv.org/abs/2602.20351
作者:Aleksandr Gushchin,Dmitriy S. Vatolin,Anastasia Antsiferova
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Full-Reference image quality, image quality assessment, current neural metrics, neural metrics remain, metrics remain slow
备注:
点击查看摘要
Abstract:Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean "anchor" samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ~3x faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.
113. 【2602.20342】Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques
链接:https://arxiv.org/abs/2602.20342
作者:Christos Maikos,Georgios Angelidis,Georgios Th. Papadopoulos
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:converting drone-captured video, drone-captured video streams, pipeline capable, capable of converting, converting drone-captured
备注: 7 pages, 2 figures
点击查看摘要
Abstract:In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7\% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.
114. 【2602.20330】Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking
链接:https://arxiv.org/abs/2602.20330
作者:Jingcheng Yang,Tianhu Xiong,Shengyi Qian,Klara Nahrstedt,Mingyuan Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:opaque black boxes, remain opaque black, Vision-language models, black boxes, powerful but remain
备注: To appear in the Findings of CVPR 2026
点击查看摘要
Abstract:Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.
115. 【2602.20328】GSNR: Graph Smooth Null-Space Representation for Inverse Problems
链接:https://arxiv.org/abs/2602.20328
作者:Romario Gualdrón-Hurtado,Roman Jacome,Rafael S. Suarez,Henry Arguello
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
关键词:imaging are ill-posed, leading to infinitely, priors promote solutions, null-space, non-trivial null-space
备注: 23 pages, 24 figures, Accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
点击查看摘要
Abstract:Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the $p$-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by $p$ modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.
116. 【2602.20312】N4MC: Neural 4D Mesh Compression
链接:https://arxiv.org/abs/2602.20312
作者:Guodong Chen,Huanshuo Dong,Mallesham Dasari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficiently compress time-varying, compress time-varying mesh, neural compression framework, time-varying mesh sequences, framework to efficiently
备注:
点击查看摘要
Abstract:We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: this https URL.
117. 【2602.20291】De-rendering, Reasoning, and Repairing Charts with Vision-Language Models
链接:https://arxiv.org/abs/2602.20291
作者:Valentin Bonas,Martin Sinnona,Viviana Siless,Emmanuel Iarussi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Data visualizations, scientific communication, everyday decision-making, mislead audiences, central to scientific
备注:
点击查看摘要
Abstract:Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.
118. 【2602.20231】UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
链接:https://arxiv.org/abs/2602.20231
作者:Manish Kumar Govind,Dominick Reilly,Pu Wang,Srijan Das
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:robot action supervision, Latent action, Latent action representations, unified latent action, explicit robot action
备注: [this https URL](https://manishgovind.github.io/unilact-vla/)
点击查看摘要
Abstract:Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.
119. 【2602.20205】OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport
链接:https://arxiv.org/abs/2602.20205
作者:Xiwen Chen,Wenhui Zhu,Gen Li,Xuanzhao Dong,Yujian Xiong,Hao Wang,Peijie Qiu,Qingquan Song,Zhipeng Wang,Shao Tang,Yalin Wang,Abolfazl Razi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-modal large language, large language models, strong visual-language reasoning, Multi-modal large, redundant visual tokens
备注: Accepted by CVPR2026 (Findings). arXiv admin note: text overlap with [arXiv:2503.02175](https://arxiv.org/abs/2503.02175) by other authors
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at this https URL.
120. 【2602.20200】Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
链接:https://arxiv.org/abs/2602.20200
作者:Zaijing Li,Bing Hu,Rui Shao,Gongwei Chen,Dongmei Jiang,Pengwei Xie,Jianye Hao,Liqiang Nie
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:robotic manipulation, dominant paradigm, paradigm for robotic, Hierarchical, action generation
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.
121. 【2602.20165】VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography
链接:https://arxiv.org/abs/2602.20165
作者:Dorsa EPMoghaddam,Feng Gao,Drew Bernard,Kavya Sinha,Mehdi Razavi,Behnaam Aazhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Contemporary high-density mapping, MRI remain time, high-density mapping techniques, Contemporary high-density, MRI remain
备注: 8 pages, 3 figures, 3 tabels
点击查看摘要
Abstract:Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.
122. 【2310.15741】Interpretable Medical Image Classification using Prototype Learning and Privileged Information
链接:https://arxiv.org/abs/2310.15741
作者:Luisa Gallee,Meinrad Beer,Michael Goetz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:medical imaging, essential requirement, requirement in medical, Advanced deep learning, Abstract
备注: MICCAI 2023 Medical Image Computing and Computer Assisted Intervention
点击查看摘要
Abstract:Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the training process can be used to create an understandable and powerful model. We propose an innovative solution called Proto-Caps that leverages the benefits of capsule networks, prototype learning and the use of privileged information. Evaluating the proposed solution on the LIDC-IDRI dataset shows that it combines increased interpretability with above state-of-the-art prediction performance. Compared to the explainable baseline model, our method achieves more than 6 % higher accuracy in predicting both malignancy (93.0 %) and mean characteristic features of lung nodules. Simultaneously, the model provides case-based reasoning with prototype representations that allow visual validation of radiologist-defined attributes.
123. 【2602.20994】Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures
链接:https://arxiv.org/abs/2602.20994
作者:Yubin Ge,Yongsong Huang,Xiaofeng Liu
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:dense tumor voxel, tumor voxel labels, learning seeks, seeks to alleviate, voxel labels
备注: IEEE International Symposium on Biomedical Imaging (ISBI) 2026
点击查看摘要
Abstract:Report-supervised (RSuper) learning seeks to alleviate the need for dense tumor voxel labels with constraints derived from radiology reports (e.g., volumes, counts, sizes, locations). In MRI studies of brain tumors, however, we often involve multi-parametric scans and substructures. Here, fine-grained modality/parameter-wise reports are usually provided along with global findings and are correlated with different substructures. Moreover, the reports often describe only the largest lesion and provide qualitative or uncertain cues (``mild,'' ``possible''). Classical RSuper losses (e.g., sum volume consistency) can over-constrain or hallucinate unreported findings under such incompleteness, and are unable to utilize these hierarchical findings or exploit the priors of varied lesion types in a merged dataset. We explicitly parse the global quantitative and modality-wise qualitative findings and introduce a unified, one-sided, uncertainty-aware formulation (MS-RSuper) that: (i) aligns modality-specific qualitative cues (e.g., T1c enhancement, FLAIR edema) with their corresponding substructures using existence and absence losses; (ii) enforces one-sided lower-bounds for partial quantitative cues (e.g., largest lesion size, minimal multiplicity); and (iii) adds extra- vs. intra-axial anatomical priors to respect cohort differences. Certainty tokens scale penalties; missing cues are down-weighted. On 1238 report-labeled BraTS-MET/MEN scans, our MS-RSuper largely outperforms both a sparsely-supervised baseline and a naive RSuper method.
124. 【2602.20539】Progressive Per-Branch Depth Optimization for DEFOM-Stereo and SAM3 Joint Analysis in UAV Forestry Applications
链接:https://arxiv.org/abs/2602.20539
作者:Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:complex forest canopies, individual branch analysis, dense disparity maps, autonomous UAV-based tree, modern stereo matchers
备注:
点击查看摘要
Abstract:Accurate per-branch 3D reconstruction is a prerequisite for autonomous UAV-based tree pruning; however, dense disparity maps from modern stereo matchers often remain too noisy for individual branch analysis in complex forest canopies. This paper introduces a progressive pipeline integrating DEFOM-Stereo foundation-model disparity estimation, SAM3 instance segmentation, and multi-stage depth optimization to deliver robust per-branch point clouds. Starting from a naive baseline, we systematically identify and resolve three error families through successive refinements. Mask boundary contamination is first addressed through morphological erosion and subsequently refined via a skeleton-preserving variant to safeguard thin-branch topology. Segmentation inaccuracy is then mitigated using LAB-space Mahalanobis color validation coupled with cross-branch overlap arbitration. Finally, depth noise - the most persistent error source - is initially reduced by outlier removal and median filtering, before being superseded by a robust five-stage scheme comprising MAD global detection, spatial density consensus, local MAD filtering, RGB-guided filtering, and adaptive bilateral filtering. Evaluated on 1920x1080 stereo imagery of Radiata pine (Pinus radiata) acquired with a ZED Mini camera (63 mm baseline) from a UAV in Canterbury, New Zealand, the proposed pipeline reduces the average per-branch depth standard deviation by 82% while retaining edge fidelity. The result is geometrically coherent 3D point clouds suitable for autonomous pruning tool positioning. All code and processed data are publicly released to facilitate further UAV forestry research.
125. 【2602.20316】Inspectorch: Efficient rare event exploration in solar observations
链接:https://arxiv.org/abs/2602.20316
作者:C. J. Díaz Baso,I. J. Soler Poquet,C. Kuckein,M. van Noort,N. Poirier
类目:olar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV)
关键词:small spatiotemporal scales, Sun is observed, unprecedented detail, enabling studies, spatiotemporal scales
备注: Comments: 12+1 pages, 11+2 figures, submitted to AA
点击查看摘要
Abstract:The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensional solar observations and optimize our computational resources to the study of these extreme phenomena. We introduce Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations. Once optimized, it assigns a probability to each sample, allowing us to identify unusual events. We apply this approach by applying it to observations from the Hinode Spectro-Polarimeter, the Interface Region Imaging Spectrograph, the Microlensed Hyperspectral Imager at Swedish 1-m Solar Telescope, the Atmospheric Imaging Assembly on board the Solar Dynamics Observatory and the Extreme Ultraviolet Imager on board Solar Orbiter. We find that the algorithm assigns consistently lower probabilities to spectra that exhibit unusual features. For example, it identifies profiles with very strong Doppler shifts, uncommon broadening, and temporal dynamics associated with small-scale reconnection events, among others. As a result, Inspectorch demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets. The resulting probabilistic anomaly scores allow computational resources to be focused on the most informative and physically relevant events. We make our Python package publicly available at this https URL.



